How To Validate Text in PDF Files Using Selenium Automation

On This Page Why is verifying PDF file message required?April 04, 2026 · 9 min read · Tool Comparison

How To Validate Text in PDF Files Using Selenium Automation

Portable Document Format (PDF) is a file format developed by Adobe in 1992 to present documents, include text formatting and images. Verifying content in PDF files is a major requirement in web applications & # 8211; peculiarly where contracts, reports, or documentation are involve. Since Selenium does not natively support read PDF substance, Apache PDFBox is used to bring and validate text from PDFs, both hosted online and downloaded topically.

Overview

Why Validate PDF Content?

  • Apps display PDF docs for official or regulatory message.
  • Verifying the contents of PDF files is essential
  • Manual chit are often wordy.

Tools & amp; Setup

  • Use Apache PDFBox library with Selenium and Java / TestNG.
  • Add dependencies (PDFBox, FontBox etc.) via Maven or external jounce.

Extracting Text from PDF

  • Use PDFTextStripper to read content. Set start and end Page if needed.
  • For PDFs online: open via URL, stream content, payload into PDDocument
  • For local PDFs: point to file itinerary or file: // URL and parse similarly.

Validations & amp; Assertions

  • Use test affirmation to verify expected text present in the PDF content.
  • Handle multi-page PDFs and choose relevant page ranges.

This clause will assist in learning how to set up, read, and assert PDF content with Selenium and PDFBox.

Why is control PDF file content required?

Almost every organization/business uses PDF files to save their official data. Let ’ s take a very simple use case & # 8211; most of the site experience some links, which when clicked, either opens the PDF in the browser ’ s reader modality or download the PDF in the local scheme depending upon the browser ’ s setting to handle PDF files.

When it comes to testing these PDF files, you can do that by manually opening the link or open the PDF file from the local scheme and verifying whether particular information is usable or not. However, verifying the message of PDF file at scale becomes cumbersome; hence, mechanisation is a must.

What is Apache PDFBox?

does not have any inbuilt functionality to test the content of PDF files; hence it want to use the third-party libraryApache PDFBox.

It is an open-source Java tool and can be used with and to assert the content of PDF. Apache PDFBox permit the creation of new PDF document, manipulation of existing papers, and the ability to extract content from documents.

This clause explores contented extraction from PDF with using Apache PDFBox.

Read More:

How to mix PDFBox with Selenium and Java

Apache PDFBox library can be downloaded and append as an external library in Eclipse or any former editor of your choice. It can also be bring as a in pom.xml

Downloading jounce and adding as an external jar:

Note:0.26 is the modish version. In the future, you may voyage tohttps: //repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/and download the latest version.

  • Download Apache FontBox JAR
  • Add these Downloaded JARs using the below steps
    • Right click on the created project inEclipse- & gt;SelectBuild Path- & gt; Configure Build Path.
    • Click on theLibrariestab and chink on “ Add External JARs ”.
    • Select the downloaded jarful and finally click on theApply and Close button

Read More:

Adding as a Maven dependency:

Step 1& # 8211; Create a Maven project in eclipse/ any Java editor by selecting archetype as “maven-archetype-quickstart” and add Selenium Java and TestNG dependencies inpom.xmlas seen below

Step 2& # 8211; Copy the latest PDFBox dependency fromhttps: //mvnrepository.com/artifact/org.apache.pdfbox/pdfboxand add it under& lt; dependencies & gt; tag in pom.xml

& lt; dependencies & gt; & lt; dependency & gt; & lt; groupId & gt; org.seleniumhq.selenium & lt; /groupId & gt; & lt; artifactId & gt; selenium-java & lt; /artifactId & gt; & lt; version & gt; 4.3.0 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; io.github.bonigarcia & lt; /groupId & gt; & lt; artifactId & gt; webdrivermanager & lt; /artifactId & gt; & lt; version & gt; 5.2.1 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; org.apache.pdfbox & lt; /groupId & gt; & lt; artifactId & gt; pdfbox & lt; /artifactId & gt; & lt; version & gt; 2.0.26 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; org.testng & lt; /groupId & gt; & lt; artifactId & gt; testng & lt; /artifactId & gt; & lt; version & gt; 7.6.1 & lt; /version & gt; & lt; /dependency & gt; & lt; /dependencies & gt;

Pro tip: Tools like SUSA can handle this autonomously — upload your app and get results without writing a single test script.

Step 3& # 8211; Save thepom.xmlfile to download all the dependencies from the MVN monument (Make sure to have an active internet connection while do this). Just to be double certain, expand the Maven Dependencies folder and verify that the required jars are downloaded.

Now that the jars are configure, let us start apply Java and TestNG to extract substance from PDF file.

How to read content from PDF file using Apache PDFBox

PDFTextStripper class of PDFBox library is used to extract/strip out the text from PDF file as seen in the bid below

String pdfContent=new PDFTextStripper () .getText (doc);

To get this “doc” reference which is passed togetTextmethod, you would postulate to write the below lines of code which we will translate step by step.

  • Let us take an instance of Google Cloud Security PDF doc. Store the pdf url in a String object apply the below command
String pdfUrl = ``https: //cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf";
  • Create URL class target ofjava.netpackage and passpdfURLas parameter.
URL url =new URL (pdfUrl);
  • Use the openStream ()method of the URL class to open a connection to this URL which returns an InputStream for reading from that connection.
InputStream is= url.openStream ();
  • After this, create an object ofBufferedInputStreamclass which creates aBufferedInputStream and pass InputStreamobject as a parameter.
BufferedInputStream bis=new BufferedInputStream (is);
  • And at concluding, use thePDDDocumentclass to represent the PDF Document.load()method of PDDocument takesinputstreamobject as a parameter, parses the PDF and returns a PDDocument object.
PDDocument doc=PDDocument.load (bis);
  • Now we are good to use “ doc ” as a reference toPDFTextStripper () .getText ()
String pdfContent=new PDFTextStripper () .getText (doc);

getText() method of PDFTextStripperis used to get the text of the document passed as a parameter and returns a String value.

  • Lastly, don ’ t forget to close the document as perPDFDocumentguideline.
doc.close ();

How to validate contents of PDF file host on the web

Directly navigate to the craved PDF file hosted on the web using a link and control the substance as seen in the example excuse in the old subdivision. Example ofReadPDFtest course, accumulate the above steps in a individual code snippet as seen below:

public stratum ReadPDF {WebDriver driver; String pdfUrl = '' https: //cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf ''; @ BeforeTest public void apparatus () {WebDriverManager.chromedriver () .setup (); driver=new ChromeDriver (); driver.manage () .window () .maximize (); driver.get (pdfUrl);} @ Test public void verifyTextFromPDF () {try {String pdfContent=getPdfContent (pdfUrl); Assert.assertTrue (pdfContent.contains (`` Secure low-level infrastructure '')); Assert.assertTrue (pdfContent.contains (`` Security of physical premiss ''));} match (IOException e) {// TODO Auto-generated catch cube e.printStackTrace ();}} public static String getPdfContent (String url) throws IOException {URL pdfURL=new URL (url); InputStream is=pdfURL.openStream (); BufferedInputStream bis=new BufferedInputStream (is); PDDocument doc=PDDocument.load (bis); int pages=doc.getNumberOfPages (); System.out.println (`` The full number of pages `` +pages); PDFTextStripper strip=new PDFTextStripper (); strip.setStartPage (1); strip.setEndPage (2); String stripText=strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}

How to swan PDF Text

Now that we have received all the content from the PDF file, which is stored in a String objectpdfContent, let us see how to affirm whether the expected text is present inpdfContentString. You can use TestNG assertions like below to assert that a given text is present in PDF.

Assert.assertTrue (pdfContent.contains (“ Google Infrastructure Security ”));

You can do multiple assertions as per your requirement. You may also use TestNGSoftAssertas it will not cast an exception when the first assertion fails (in case of multiple asserts), and rather records all exceptions and will throw the exception at the end usingassertall () method of SoftAssert class.

Read More:

How to set the start and end page of PDF for descent?

PDFBox can parse large PDF file as well. For example, if you are testing a PDF file of 40 pages and you are interested in parsing only limited pages, you can reach that withsetStartPage (int startPageValue) and setEndPage(int endPageValue) method ofPDFTextStripper class.

strip.setStartPage (x); strip.setEndPage (y);

Here x and yare get and ending values of the Pages of the PDF that you ask to elicit.

Code to Read PDF from a given kickoff page to the end page

public unchanging String getPdfContent (String url) throws IOException {URL pdfURL=new URL (url); InputStream is=pdfURL.openStream (); BufferedInputStream bis=new BufferedInputStream (is); PDDocument doc=PDDocument.load (bis); int pages=doc.getNumberOfPages (); System.out.println (`` The total number of Page `` +pages); PDFTextStripper strip=new PDFTextStripper (); strip.setStartPage (1); strip.setEndPage (2); String stripText=strip.getText (doc); System.out.println (stripText); doc.close (); revert stripText;}

How to validate contents of PDF opened in another browser tab

Navigating to a webpage, tick on a link (using a) which opens the PDF in the like or another browser tab. And so use that PDF URL to parse and verify the content as seen in the example below:

public class PDFInBrowser {WebDriver driver; String url = `` http: //www.princexml.com/samples/ ''; String pdfUrl; @ BeforeTest public vacancy apparatus () {WebDriverManager.chromedriver () .setup (); driver = new ChromeDriver (); driver.manage () .window () .maximize (); driver.get (url); driver.findElement (By.xpath (`` (//a [contains (@ href, 'drylab.pdf ')]) [2] '')) .click (); pdfUrl = driver.getCurrentUrl ();} @ Test public void verifyTextFromPDF () {try {String pdfContent = getPdfContent (pdfUrl); Assert.assertTrue (pdfContent.contains (`` New York, St. Louis, San Francisco ''));} catch (IOException e) {// TODO Auto-generated catch cube e.printStackTrace ();}} public inactive String getPdfContent (String url) throws IOException {URL pdfURL = new URL (url); InputStream is = pdfURL.openStream (); BufferedInputStream bis = new BufferedInputStream (is); PDDocument doc = PDDocument.load (bis); PDFTextStripper strip = new PDFTextStripper (); String stripText = strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}

Talk to an Expert

How to formalize contents of already downloaded PDF in the Downloads folder

Download any PDF file to be tested. Go to the pamphlet where the PDF file is present. Right-click on the PDF and select “ Open With ” any browser. Copy the entire URL and use it

For Example, verifying the contents of PDF store in the folder location as mentioned below

file: ///C: /Users/ & lt; username & gt; /Downloads/google_infrastructure_whitepaper_fa.pdf

Code to Verify the contents of the PDF at a give folder location

public class PDFDownload {String url = `` file: ///C: /Users/lenovo/Downloads/google_infrastructure_whitepaper_fa.pdf ''; @ Test public emptiness verifyTextFromPDF () {try {String pdfContent = getPdfContent (url); Assert.assertTrue (pdfContent.contains (`` Secure low-level infrastructure ''));} catch (IOException e) {// TODO Auto-generated gimmick cube e.printStackTrace ();}} public static String getPdfContent (String url) throws IOException {URL pdfURL = new URL (url); InputStream is = pdfURL.openStream (); BufferedInputStream bis = new BufferedInputStream (is); PDDocument doc = PDDocument.load (bis); PDFTextStripper disrobe = new PDFTextStripper (); String stripText = strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}

As per PDFTextStrippercourse documentation, this class will take a pdf document, strip out all of the text, and dismiss the format. Please note; that it is up toclientsof this class to verify that a specific exploiter has the right permission to evoke text from the PDF document. This means you have to make sure that you are using a valid pdf file with proper permission to accession. Also, itcan not parse any image filewhich is skim into a PDF.

As incessantly, it is important to run Selenium tests on real browser and devices. BrowserStack offer a of 3000+ real browser and devices for testing intention. Simply ratify up, select the compulsory device-browser-OS combination from, and start testing websites for free.

Tags
86,000+ Views

# Ask-and-Contributeabout this issue with our Discord community.

Related Guides

Automate This With SUSA

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed.

Try SUSA Free

Test Your App Autonomously

Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.

Try SUSA Free