Skip to main content

How to Extract Content from PDF Documents

  • 6 minutes to read

Currently, the PDF Viewer provides two PDF document content extraction techniques:

  • Selecting and copying specific content to the clipboard;

  • Extracting the required content directly from the loaded PDF document’s internal representation.

The direct approach described in this topic includes only the use of an API provided by the PDF Viewer, while the technique discussed in the How to Select and Copy the PDF Document’s Content topic, can be employed both by end-users and utilized programmatically.

You can obtain text and/or images from the PDF document loaded to:

  • The TdxPDFViewer control (that is, a visual component);

  • An independent PDF document container allowing you to work with the document without displaying it in your application.

The PDF document container provides zero-based access to information on individual pages and their content via the PageInfo property. You can use the page’s Text, Images, and Hyperlinks fields to obtain text, images, and external hyperlinks if the content select and copy operations are allowed (that is, the document’s AllowContentExtraction property returns True).

You can use the page’s Text field to extract all its text lines as a single string. Individual text lines within the string are delimited with line break characters (that is, #13#10 in Delphi or \n in C++Builder). The Text field returns an empty string instead of the page’s content if the select and copy operations are forbidden for the loaded document.

Each image on a page is accessible via the Images field. You can obtain an individual image as a bitmap by reading the Bitmap property if the content select and copy operations are allowed for the document (that is, the Images field provides access to a valid image collection), and there is at least a single image on the inspected page (that is, if the Images.Count property value is positive).

To extract all the images from the loaded PDF file, you need to obtain all PDF image bitmaps from each document page within two nested loops. Additional operations related to saving the resulting images to bitmap files may include:

  • Identification of the source PDF file name (which is convenient to use as the prefix for generating unique bitmap file names) by using the document’s Information.FileName property;

  • Generating the destination folder name by appending the file prefix and the backslash character to the source folder;

  • Creating the destination folder if it does not yet exist;

  • Converting the extracted images to the required bitmap format and saving them as files with unique names.

The following code example saves all images from all document pages as a series of PNG files with unique names generated from the source file name followed by the corresponding page indexes:

var
  ABitmap: TBitmap;
  AImage: TdxSmartImage;  // A TdxSmartImage object is used to store, convert, and save a single extracted document image
  ADocument: TdxPDFDocument;  // This variable is used to access the loaded document
  AFolder, APrefix: string;
  I, J: Integer; // Cycle counters
begin
  if(not dxPDFViewer1.IsDocumentLoaded or
    not dxPDFViewer.Document.AllowContentExtraction) then Exit;  // Exits the image extraction routine if the PDF Viewer control has no loaded document or the content select and copy operations are forbidden
  AImage := TdxSmartImage.Create;  // Creates a new TdxSmartImage container...
  ADocument := dxPDFViewer1.Document;
  APrefix := TPath.GetFileNameWithoutExtension(ADocument.Information.FileName);  // Extracts the source file name for use as the result file prefix
  AFolder := TPath.GetDirectoryName(ADocument.Information.FileName) + APrefix + '\';  // Extracts the path to the folder containing the source PDF file and generates the destination folder name
  if not DirectoryExists(AFolder) then CreateDir(AFolder);  // Creates the destination folder if it does not yet exist
  for I := 0 to ADocument.PageCount - 1 do  // Cycles through all pages of the loaded document
    for J := 0 to ADocument.PageInfo[I].Images.Count - 1 do  // Cycles through all images on a single document page
      begin
        ABitmap := ADocument.PageInfo[I].Images[J].Bitmap;  // Obtains a document image as a bitmap
        AImage.CreateFromBitmap(ABitmap);  // Saves an obtained image to the TdxSmartImage container
        AImage.ImageDataFormat := dxImagePng;  // Changes the stored bitmap's format to PNG
        AImage.SaveToFile(AFolder + APrefix + '_' + IntToStr(I) + '_' + IntToStr(J) + '.png');  // Saves the resulting bitmap as a PNG file with a unique generated name
      end;
  AImage.Free;  // Frees the TdxSmarImage container to prevent memory leaks
end;

Each name of the saved PNG file starts with the original PDF file name (without an extension) used as a prefix followed by the respective source page and image indexes. For instance, if the source file is named “Demo.pdf”, the images on the first document page are saved as “Demo_0_0.png”, “Demo_0_1.png”, etc. files within the /Demo/ folder:

The page’s Hyperlinks field provides zero-based indexed access to individual hyperlinks on a page in the same manner as the Images field provides access to the image collection. Each hyperlink has the Hint property that you can use to obtain a hint that the PDF Viewer control displays if its OptionsBehavior.ShowHint property is set to True. The Hint property is automatically initialized to the external hyperlink’s universal resource identifier (URI). Since internal hyperlinks within a PDF document have no URI, this property returns an empty string in the case of internal hyperlinks.