How To Validate Text in PDF Files Using Selenium Automation
On This Page Why is verifying PDF file message required?April 04, 2026 · 9 min read · Tool Comparison
Portable Document Format (PDF) is a file format developed by Adobe in 1992 to present documents, include text formatting and images. Verifying content in PDF files is a major requirement in web applications & # 8211; peculiarly where contracts, reports, or documentation are involve. Since Selenium does not natively support read PDF substance, Apache PDFBox is used to bring and validate text from PDFs, both hosted online and downloaded topically. Why Validate PDF Content? Tools & amp; Setup Extracting Text from PDF Validations & amp; Assertions This clause will assist in learning how to set up, read, and assert PDF content with Selenium and PDFBox. Almost every organization/business uses PDF files to save their official data. Let ’ s take a very simple use case & # 8211; most of the site experience some links, which when clicked, either opens the PDF in the browser ’ s reader modality or download the PDF in the local scheme depending upon the browser ’ s setting to handle PDF files. When it comes to testing these PDF files, you can do that by manually opening the link or open the PDF file from the local scheme and verifying whether particular information is usable or not. However, verifying the message of PDF file at scale becomes cumbersome; hence, mechanisation is a must. does not have any inbuilt functionality to test the content of PDF files; hence it want to use the third-party libraryApache PDFBox. It is an open-source Java tool and can be used with and to assert the content of PDF. Apache PDFBox permit the creation of new PDF document, manipulation of existing papers, and the ability to extract content from documents. This clause explores contented extraction from PDF with using Apache PDFBox. Read More: Apache PDFBox library can be downloaded and append as an external library in Eclipse or any former editor of your choice. It can also be bring as a in pom.xml Downloading jounce and adding as an external jar: Note:0.26 is the modish version. In the future, you may voyage tohttps: //repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/and download the latest version. Read More: Adding as a Maven dependency: Step 1& # 8211; Create a Maven project in eclipse/ any Java editor by selecting archetype as “maven-archetype-quickstart” and add Selenium Java and TestNG dependencies inpom.xmlas seen below Step 2& # 8211; Copy the latest PDFBox dependency fromhttps: //mvnrepository.com/artifact/org.apache.pdfbox/pdfboxand add it under& lt; dependencies & gt; tag in pom.xml Pro tip: Tools like SUSA can handle this autonomously — upload your app and get results without writing a single test script. Step 3& # 8211; Save thepom.xmlfile to download all the dependencies from the MVN monument (Make sure to have an active internet connection while do this). Just to be double certain, expand the Maven Dependencies folder and verify that the required jars are downloaded. Now that the jars are configure, let us start apply Java and TestNG to extract substance from PDF file. PDFTextStripper class of PDFBox library is used to extract/strip out the text from PDF file as seen in the bid below To get this “doc” reference which is passed togetTextmethod, you would postulate to write the below lines of code which we will translate step by step. getText() method of PDFTextStripperis used to get the text of the document passed as a parameter and returns a String value. Directly navigate to the craved PDF file hosted on the web using a link and control the substance as seen in the example excuse in the old subdivision. Example ofReadPDFtest course, accumulate the above steps in a individual code snippet as seen below: Now that we have received all the content from the PDF file, which is stored in a String objectpdfContent, let us see how to affirm whether the expected text is present inpdfContentString. You can use TestNG assertions like below to assert that a given text is present in PDF. You can do multiple assertions as per your requirement. You may also use TestNGSoftAssertas it will not cast an exception when the first assertion fails (in case of multiple asserts), and rather records all exceptions and will throw the exception at the end usingassertall () method of SoftAssert class. Read More: PDFBox can parse large PDF file as well. For example, if you are testing a PDF file of 40 pages and you are interested in parsing only limited pages, you can reach that withsetStartPage (int startPageValue) and setEndPage(int endPageValue) method ofPDFTextStripper class. Here x and yare get and ending values of the Pages of the PDF that you ask to elicit. Code to Read PDF from a given kickoff page to the end page Navigating to a webpage, tick on a link (using a) which opens the PDF in the like or another browser tab. And so use that PDF URL to parse and verify the content as seen in the example below: Download any PDF file to be tested. Go to the pamphlet where the PDF file is present. Right-click on the PDF and select “ Open With ” any browser. Copy the entire URL and use it For Example, verifying the contents of PDF store in the folder location as mentioned below Code to Verify the contents of the PDF at a give folder location As per PDFTextStrippercourse documentation, this class will take a pdf document, strip out all of the text, and dismiss the format. Please note; that it is up toclientsof this class to verify that a specific exploiter has the right permission to evoke text from the PDF document. This means you have to make sure that you are using a valid pdf file with proper permission to accession. Also, itcan not parse any image filewhich is skim into a PDF. As incessantly, it is important to run Selenium tests on real browser and devices. BrowserStack offer a of 3000+ real browser and devices for testing intention. Simply ratify up, select the compulsory device-browser-OS combination from, and start testing websites for free. On This Page # Ask-and-Contributeabout this issue with our Discord community. Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts needed. Upload your APK or URL. SUSA explores like 10 real users — finds bugs, accessibility violations, and security issues. No scripts.How To Validate Text in PDF Files Using Selenium Automation
Overview
Why is control PDF file content required?
What is Apache PDFBox?
How to mix PDFBox with Selenium and Java
& lt; dependencies & gt; & lt; dependency & gt; & lt; groupId & gt; org.seleniumhq.selenium & lt; /groupId & gt; & lt; artifactId & gt; selenium-java & lt; /artifactId & gt; & lt; version & gt; 4.3.0 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; io.github.bonigarcia & lt; /groupId & gt; & lt; artifactId & gt; webdrivermanager & lt; /artifactId & gt; & lt; version & gt; 5.2.1 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; org.apache.pdfbox & lt; /groupId & gt; & lt; artifactId & gt; pdfbox & lt; /artifactId & gt; & lt; version & gt; 2.0.26 & lt; /version & gt; & lt; /dependency & gt; & lt; dependency & gt; & lt; groupId & gt; org.testng & lt; /groupId & gt; & lt; artifactId & gt; testng & lt; /artifactId & gt; & lt; version & gt; 7.6.1 & lt; /version & gt; & lt; /dependency & gt; & lt; /dependencies & gt;
How to read content from PDF file using Apache PDFBox
String pdfContent=new PDFTextStripper () .getText (doc);
String pdfUrl = ``https: //cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf";
URL url =new URL (pdfUrl);
InputStream is= url.openStream ();
BufferedInputStream bis=new BufferedInputStream (is);
PDDocument doc=PDDocument.load (bis);
String pdfContent=new PDFTextStripper () .getText (doc);
doc.close ();
How to validate contents of PDF file host on the web
public stratum ReadPDF {WebDriver driver; String pdfUrl = '' https: //cloud.google.com/docs/security/infrastructure/design/resources/google_infrastructure_whitepaper_fa.pdf ''; @ BeforeTest public void apparatus () {WebDriverManager.chromedriver () .setup (); driver=new ChromeDriver (); driver.manage () .window () .maximize (); driver.get (pdfUrl);} @ Test public void verifyTextFromPDF () {try {String pdfContent=getPdfContent (pdfUrl); Assert.assertTrue (pdfContent.contains (`` Secure low-level infrastructure '')); Assert.assertTrue (pdfContent.contains (`` Security of physical premiss ''));} match (IOException e) {// TODO Auto-generated catch cube e.printStackTrace ();}} public static String getPdfContent (String url) throws IOException {URL pdfURL=new URL (url); InputStream is=pdfURL.openStream (); BufferedInputStream bis=new BufferedInputStream (is); PDDocument doc=PDDocument.load (bis); int pages=doc.getNumberOfPages (); System.out.println (`` The full number of pages `` +pages); PDFTextStripper strip=new PDFTextStripper (); strip.setStartPage (1); strip.setEndPage (2); String stripText=strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}How to swan PDF Text
Assert.assertTrue (pdfContent.contains (“ Google Infrastructure Security ”));
How to set the start and end page of PDF for descent?
strip.setStartPage (x); strip.setEndPage (y);
public unchanging String getPdfContent (String url) throws IOException {URL pdfURL=new URL (url); InputStream is=pdfURL.openStream (); BufferedInputStream bis=new BufferedInputStream (is); PDDocument doc=PDDocument.load (bis); int pages=doc.getNumberOfPages (); System.out.println (`` The total number of Page `` +pages); PDFTextStripper strip=new PDFTextStripper (); strip.setStartPage (1); strip.setEndPage (2); String stripText=strip.getText (doc); System.out.println (stripText); doc.close (); revert stripText;}How to validate contents of PDF opened in another browser tab
public class PDFInBrowser {WebDriver driver; String url = `` http: //www.princexml.com/samples/ ''; String pdfUrl; @ BeforeTest public vacancy apparatus () {WebDriverManager.chromedriver () .setup (); driver = new ChromeDriver (); driver.manage () .window () .maximize (); driver.get (url); driver.findElement (By.xpath (`` (//a [contains (@ href, 'drylab.pdf ')]) [2] '')) .click (); pdfUrl = driver.getCurrentUrl ();} @ Test public void verifyTextFromPDF () {try {String pdfContent = getPdfContent (pdfUrl); Assert.assertTrue (pdfContent.contains (`` New York, St. Louis, San Francisco ''));} catch (IOException e) {// TODO Auto-generated catch cube e.printStackTrace ();}} public inactive String getPdfContent (String url) throws IOException {URL pdfURL = new URL (url); InputStream is = pdfURL.openStream (); BufferedInputStream bis = new BufferedInputStream (is); PDDocument doc = PDDocument.load (bis); PDFTextStripper strip = new PDFTextStripper (); String stripText = strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}How to formalize contents of already downloaded PDF in the Downloads folder
file: ///C: /Users/ & lt; username & gt; /Downloads/google_infrastructure_whitepaper_fa.pdf
public class PDFDownload {String url = `` file: ///C: /Users/lenovo/Downloads/google_infrastructure_whitepaper_fa.pdf ''; @ Test public emptiness verifyTextFromPDF () {try {String pdfContent = getPdfContent (url); Assert.assertTrue (pdfContent.contains (`` Secure low-level infrastructure ''));} catch (IOException e) {// TODO Auto-generated gimmick cube e.printStackTrace ();}} public static String getPdfContent (String url) throws IOException {URL pdfURL = new URL (url); InputStream is = pdfURL.openStream (); BufferedInputStream bis = new BufferedInputStream (is); PDDocument doc = PDDocument.load (bis); PDFTextStripper disrobe = new PDFTextStripper (); String stripText = strip.getText (doc); System.out.println (stripText); doc.close (); return stripText;}}Related Guides
Automate This With SUSA
Test Your App Autonomously