Search for Text in PDF

The PDF to Text Converter component allows you to search for text in PDF documents and retrieve the text location and bounding rectangle information in PDF documents. This component is distributed as part of the Winnovative.Pdf.Next.PdfProcessor.Windows NuGet package when targeting Windows and as part of the Winnovative.Pdf.Next.PdfProcessor.Linux package when targeting Linux. The package for Windows is referenced by the Winnovative.Pdf.Next.Windows meta package and the package for Linux is referenced by the Winnovative.Pdf.Next.Linux meta package.

You can set whether the search is case-sensitive and whether it should match whole words only. You can also set the page range in the PDF document where the text is searched. Text search in password protected PDF documents is also supported and you can provide both user and owner passwords.

Overview

The Winnovative.Pdf.NextPdfToTextConverter class allows you to load a PDF file and search for text in its pages.

Create the PDF to Text Converter

The Winnovative.Pdf.NextPdfToTextConverter class is used to search for text in PDF documents. You can create an instance using the default constructor, which initializes the converter with standard settings. These settings can later be customized through properties like PdfToTextConverterUserPassword, PdfToTextConverterOwnerPassword and others, which control the text search process.

Create a PDF to Text Converter Instance
// Create a new PDF to Text converter instance
PdfToTextConverter pdfToTextConverter = new PdfToTextConverter();

Note that PdfToTextConverter instances are not reusable. You must create a new instance for each search. Reusing an instance after a completed search will result in an exception.

Open Password Protected PDFs

If the PDF document you search for text is password protected you have to specify the user or owner password to be used to decrypt the PDF document before searching the text. You can set the user password in the PdfToTextConverterUserPassword property and the owner password in the PdfToTextConverterOwnerPassword property.

Set User and Owner Passwords
pdfToTextConverter.UserPassword = userPasswordString;
pdfToTextConverter.OwnerPassword = ownerPasswordString;

Searched Page Range Limit

The PdfToTextConverterMaxPageCount property controls the upper limit for the number of PDF pages to search. The PDF page range to search for text can be set in the find text methods. The default is 0, which means there is no upper limit.

Unlimited Searched PDF Page Range
pdfToTextConverter.MaxPageCount = 0;
Limit the Searched PDF Page Range to 10 Pages
pdfToTextConverter.MaxPageCount = 10;

Find Text in PDF

To search all pages in a PDF document from a memory buffer, use the PdfToTextConverterFindText(Byte, String, Boolean, Boolean) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is a flag indicating whether the search should be case-sensitive, and the fourth parameter is a flag indicating whether the search should match whole words only.

The function returns an array of Winnovative.Pdf.NextFindTextLocation objects representing the locations where the specified text was found in the PDF document. The FindTextLocation object contains information such as the page number in FindTextLocationPageNumber and the bounding rectangle given by coordinates relative to the top left corner of the PDF page in FindTextLocationX, FindTextLocationY, FindTextLocationWidth, FindTextLocationHeight. The found locations are returned in the order they are found in the PDF document by a top to bottom, left to right search.

Find Text in All Pages in a PDF from Memory
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfBytes, textToFindString, caseSensitive, wholeWord);

To search a PDF document from a memory buffer starting at the given page number through the end of the document, use the PdfToTextConverterFindText(Byte, String, Int32, Boolean, Boolean) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is the 1-based start page number, the fourth parameter is a flag indicating whether the search should be case-sensitive, and the fifth parameter is a flag indicating whether the search should match whole words only.

Search Pages in a PDF from Memory Starting at a Given Page Number
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfBytes, textToFindString, startPageNumber, caseSensitive, wholeWord);

To search a PDF document from a memory buffer starting at the given page number up to the end page number inclusive, use the PdfToTextConverterFindText(Byte, String, Int32, Int32, Boolean, Boolean) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is the 1-based start page number, the fourth parameter is the 1-based end page number, the fifth parameter is a flag indicating whether the search should be case-sensitive, and the sixth parameter is a flag indicating whether the search should match whole words only. If the end page number is 0, the search continues to the end of the document.

Search a Range of Pages in a PDF from Memory
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfBytes, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);

There are also similar methods to search a PDF for text that accept a PDF stream or a PDF file path.

Search All Pages in a PDF from a Stream or File
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfStream, textToFindString, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfFile, textToFindString, caseSensitive, wholeWord);
Search Pages in a PDF from a Stream or File Starting at a Given Page Number
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfStream, textToFindString, startPageNumber, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfFile, textToFindString, startPageNumber, caseSensitive, wholeWord);
Search a Range of Pages in a PDF from a Stream or File
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfStream, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfFile, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);

Asynchronous Methods to Find Text in PDF

There are also asynchronous variants of these methods that follow the Task-based Asynchronous Pattern (TAP) in .NET, allowing text search in PDF documents to run in parallel using async and await. These methods share the same names as their synchronous counterparts and include the "Async" suffix. They also accept an optional System.ThreadingCancellationToken parameter that can be used to cancel the conversion operation where applicable.

To search all pages in a PDF document from a memory buffer, use the PdfToTextConverterFindTextAsync(Byte, String, Boolean, Boolean, CancellationToken) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is a flag indicating whether the search should be case-sensitive, and the fourth parameter is a flag indicating whether the search should match whole words only.

Asynchronously Find Text in All Pages in a PDF from Memory
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfBytes, textToFindString, caseSensitive, wholeWord);

To search a PDF document from a memory buffer starting at the given page number through the end of the document, use the PdfToTextConverterFindTextAsync(Byte, String, Int32, Boolean, Boolean, CancellationToken) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is the 1-based start page number, the fourth parameter is a flag indicating whether the search should be case-sensitive, and the fifth parameter is a flag indicating whether the search should match whole words only.

Asynchronously Search Pages in a PDF from Memory Starting at a Given Page Number
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfBytes, textToFindString, startPageNumber, caseSensitive, wholeWord);

To search a PDF document from a memory buffer starting at the given page number up to the end page number inclusive, use the PdfToTextConverterFindTextAsync(Byte, String, Int32, Int32, Boolean, Boolean, CancellationToken) method. The first parameter is the PDF document read into a memory buffer, the second parameter is the text to find, the third parameter is the 1-based start page number, the fourth parameter is the 1-based end page number, the fifth parameter is a flag indicating whether the search should be case-sensitive, and the sixth parameter is a flag indicating whether the search should match whole words only. If the end page number is 0, the search continues to the end of the document.

Asynchronously Search a Range of Pages in a PDF from Memory
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfBytes, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);

There are also similar methods to search a PDF for text that accept a PDF stream or a PDF file path.

Asynchronously Search All Pages in a PDF from a Stream or File
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfStream, textToFindString, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfFile, textToFindString, caseSensitive, wholeWord);
Asynchronously Search Pages in a PDF from a Stream or File Starting at a Given Page Number
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfStream, textToFindString, startPageNumber, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfFile, textToFindString, startPageNumber, caseSensitive, wholeWord);
Asynchronously Search a Range of Pages in a PDF from a Stream or File
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfStream, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);
FindTextLocation[] findTextLocations = await pdfToTextConverter.FindTextAsync(inputPdfFile, textToFindString, startPageNumber, endPageNumber, caseSensitive, wholeWord);

Conversion Info

The PdfToTextConverterConversionInfo property exposes an object of Winnovative.Pdf.NextPdfToTextConversionInfo type which is populated after the conversion completes successfully with information about the conversion process such as the number of pages searched.

Gets the Number of PDF Pages Searched for Text
int numberOfPagesSearched = pdfToTextConverter.ConversionInfo.PageCount;

Code Sample - Find Text in PDF

Find Text in PDF in ASP.NET Core
using System;
using System.IO;
using System.Threading.Tasks;
using System.ComponentModel.DataAnnotations;
using Microsoft.AspNetCore.Hosting;
using Microsoft.AspNetCore.Http;
using Microsoft.AspNetCore.Mvc;
using Winnovative_Next_AspNetDemo.Models;
using Winnovative_Next_AspNetDemo.Models.PDF_to_Text;

// Use Winnovative Namespace
using Winnovative.Pdf.Next;

namespace Winnovative_Next_AspNetDemo.Controllers.PDF_to_Text
{
    public class Find_PDF_TextController : Controller
    {
        private readonly IWebHostEnvironment m_hostingEnvironment;
        public Find_PDF_TextController(IWebHostEnvironment hostingEnvironment)
        {
            m_hostingEnvironment = hostingEnvironment;
        }

        public IActionResult Index()
        {
            var model = SetViewModel();

            return View(model);
        }

        [HttpPost]
        public async Task<IActionResult> FindPdfText(Find_PDF_Text_ViewModel model)
        {
            if (!ModelState.IsValid)
            {
                var errorMessage = ModelStateHelper.GetModelErrors(ModelState);
                throw new ValidationException(errorMessage);
            }

            // Set license key received after purchase to use the converter in licensed mode
            // Leave it not set to use the library in demo mode
            Licensing.LicenseKey = "3FJDU0ZDU0NTQkddQ1NAQl1CQV1KSkpKU0M=";

            // Create the PDF to Text converter instance with default options
            PdfToTextConverter pdfToTextConverter = new PdfToTextConverter();

            // Optionally set the user password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.UserPassword))
                pdfToTextConverter.UserPassword = model.UserPassword;

            // Optionally set the owner password to open a password-protected PDF
            if (!string.IsNullOrEmpty(model.OwnerPassword))
                pdfToTextConverter.OwnerPassword = model.OwnerPassword;

            // PDF page number to start text search from
            int startPageNumber = model.StartPageNumber;

            // PDF page number to end text search at
            // If 0, search continues to the end of the document
            int endPageNumber = 0;
            if (model.EndPageNumber.HasValue)
                endPageNumber = model.EndPageNumber.Value;

            byte[] inputPdfBytes = null;
            string outputFileName = null;

            // If an uploaded file exists, use it with priority
            if (model.PdfFile != null && model.PdfFile.Length > 0)
            {
                try
                {
                    using var ms = new MemoryStream();
                    await model.PdfFile.CopyToAsync(ms);
                    inputPdfBytes = ms.ToArray();
                }
                catch (Exception ex)
                {
                    throw new Exception("Failed to read the uploaded PDF file", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFile.FileName) + "_Highlighted.pdf";
            }
            else
            {
                // Otherwise, fall back to the URL
                string pdfUrl = model.PdfFileUrl?.Trim();
                if (string.IsNullOrWhiteSpace(pdfUrl))
                    throw new Exception("No PDF file provided: upload a file or specify a URL");

                try
                {
                    if (pdfUrl.StartsWith("file://", StringComparison.OrdinalIgnoreCase))
                    {
                        string localPath = new Uri(pdfUrl).LocalPath;
                        inputPdfBytes = await System.IO.File.ReadAllBytesAsync(localPath);
                    }
                    else
                    {
                        using var httpClient = new System.Net.Http.HttpClient();
                        inputPdfBytes = await httpClient.GetByteArrayAsync(pdfUrl);
                    }
                }
                catch (Exception ex)
                {
                    throw new Exception("Could not download the PDF file from URL", ex);
                }

                outputFileName = Path.GetFileNameWithoutExtension(model.PdfFileUrl) + "_Highlighted.pdf";
            }

            // Search text in PDF
            FindTextLocation[] findTextLocations = pdfToTextConverter.FindText(inputPdfBytes, model.TextToFind,
                        startPageNumber, endPageNumber, model.CaseSensitive, model.WholeWord);

            // Open the PDF in editor
            string password = string.IsNullOrEmpty(model.OwnerPassword)? model.UserPassword : model.OwnerPassword;
            using PdfEditor pdfEditor = new PdfEditor(inputPdfBytes, password);

            // Highlight the found text in PDF
            foreach (FindTextLocation findTextLocation in findTextLocations)
            {
                PdfRectangleElement highlightRectangle = new PdfRectangleElement(findTextLocation.X, findTextLocation.Y,
                    findTextLocation.Width, findTextLocation.Height);
                highlightRectangle.BorderColor = PdfColor.Yellow;

                pdfEditor.AddRectangle(findTextLocation.PageNumber, highlightRectangle);
            }

            // Save the highlighted PDF in a memory buffer
            byte[] outPdfBuffer = pdfEditor.Save();

            // Return the highlighted PDF as a downloadable file
            return File(outPdfBuffer, "application/pdf", outputFileName);
        }

        private Find_PDF_Text_ViewModel SetViewModel()
        {
            var model = new Find_PDF_Text_ViewModel();

            HttpRequest request = ControllerContext.HttpContext.Request;
            UriBuilder uriBuilder = new UriBuilder();
            uriBuilder.Scheme = request.Scheme;
            uriBuilder.Host = request.Host.Host;
            if (request.Host.Port != null)
                uriBuilder.Port = (int)request.Host.Port;
            uriBuilder.Path = request.PathBase.ToString() + request.Path.ToString();
            uriBuilder.Query = request.QueryString.ToString();

            string currentPageUrl = uriBuilder.Uri.AbsoluteUri;
            string rootUrl = currentPageUrl.Substring(0, currentPageUrl.Length - "Find_PDF_Text".Length);

            model.PdfFileUrl = rootUrl + "/DemoAppFiles/Input/PdfProcessor_Files/PDF_Document.pdf";

            return model;
        }
    }
}

See Also