Read PDF File in Java
Apache PDFBox, iText 5, iText 7
In this tutorial, we’ll learn how to read a PDF file in Java using different libraries.
Overview
Portable Document Format (PDF) is a popular and widely used file format for documents. PDF format is the first choice for electronic distribution (e.g. email attachments) and print media.
Unlike text files, reading the data from PDF files is very complex and Java doesn’t provide native support for reading PDF files. The good news is that there are many open-source Java libraries available, that we can use. In this article, we’ll look at some of the popular libraries for reading PDF file in Java:-
Apache PDFBox
Apache PDFBox library allows you to create new PDF documents, extract content from PDF, fill a PDF form, save PDF as an Image, split & merge, digital sign PDF files, print PDF files, and many more.
Apache PDFBox is quite easy to use if you’re doing basic text extraction, let’s look at the example:-
- Let’s first import the pdfbox dependency to the pom.xml
or build.gradle
<dependency> <groupId>org.apache.pdfbox</groupId> <artifactId>pdfbox</artifactId> <version>2.0.27</version> </dependency>
implementation group: 'org.apache.pdfbox', name: 'pdfbox', version: '2.0.27'
- Let’s look at the simple example of using Apache PDFBox to read text from a PDF file:-
import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.text.PDFTextStripper; import java.io.File; import java.io.IOException; public class PdfDocumentReader { public void readPdfFile(File file) throws IOException { try(PDDocument document = PDDocument.load(file)) { if (!document.isEncrypted()) { PDFTextStripper textStripper = new PDFTextStripper(); String text = textStripper.getText(document); System.out.println("Text:" + text); } } } }
In this example, we initialized the instance of PDDocument
in the try block to auto-close the resources after reading the PDF file.
Please refer to apache pdfbox examples for more examples
iText 5
iText has open-source libraries to read and write complex PDF files. iText library provides low-level support to read a PDF file but is a bit complex to understand and work with.
Please note that iText 5 is EOL and transitioned to maintenance mode, meaning it only receives security-related releases and fixes to allow users who have developed their solutions using iText 5 to safely continue using it. No new features will be added. For new implementations, iText 7 is recommended.
Let’s explore how to read PDF using iText 5:-
- Let’s first import the itextpdf dependency to the pom.xml
or build.gradle
<dependency> <groupId>com.itextpdf</groupId> <artifactId>itextpdf</artifactId> <version>5.5.13.3</version> </dependency>
implementation group: 'com.itextpdf', name: 'itextpdf', version: '5.5.13.3'
- Let’s look at the simple example of using iText to read text from a PDF file:-
import com.itextpdf.text.pdf.PdfReader; import com.itextpdf.text.pdf.parser.PdfTextExtractor; import java.io.IOException; public class PdfDocumentReader { public void readPdfFile(String fileName) throws IOException { PdfReader pdfReader = new PdfReader(fileName); int pages = pdfReader.getNumberOfPages(); StringBuilder text = new StringBuilder(); for (int i = 1; i <= pages; i++) { text.append(PdfTextExtractor.getTextFromPage(pdfReader, i)); } System.out.println("Text:" + text); } }
In this example, we initialized the instance of PdfReader
to load a PDF file and then looped through the pages to extract the content from each page.
iText 7
iText 7 is recommended iText library to read a PDF file. Let’s explore that:-
- Let’s first import the itext7-core dependency to the pom.xml
or build.gradle
<dependency> <groupId>com.itextpdf</groupId> <artifactId>itext7-core</artifactId> <version>7.2.5</version> <type>pom</type> </dependency>
implementation group: 'com.itextpdf', name: 'itext7-core', version: '7.2.5', ext: 'pom'
- and write a code using iText to read text from a PDF file:-
import com.itextpdf.kernel.pdf.PdfDocument; import com.itextpdf.kernel.pdf.PdfPage; import com.itextpdf.kernel.pdf.PdfReader; import com.itextpdf.kernel.pdf.canvas.parser.PdfTextExtractor; import java.io.File; import java.io.IOException; public class PdfDocumentReader { public void readPdfFileFromPdfBox7(File file) throws IOException { StringBuilder text = new StringBuilder(); try(PdfDocument document = new PdfDocument(new PdfReader(file))){ for (int i = 1; i <= document.getNumberOfPages(); ++i) { PdfPage page = document.getPage(i); text.append(PdfTextExtractor.getTextFromPage(page)); } } System.out.println("Text:" + text); } }
In this example, we initialized the instance of PdfDocument
in the try block to auto-close the resources after reading the PDF file.