How to: Extract Text from a Document

Jul 13, 2020
3 minutes to read

Important

You need a license to the DevExpress Office File API or DevExpress Universal Subscription to use these examples in production code. Refer to the DevExpress Subscription page for pricing information.

Extract All Text

This tutorial describes how to extract the text of a PDF file at runtime using the PDF Document API.

To extract the text of a PDF file, do the following.

Create a PdfDocumentProcessor.
To open a PDF file, pass a stream that contains the document data to the PdfDocumentProcessor.LoadDocument method.
After the document is loaded, you can extract its plain text using the PdfDocumentProcessor.Text property.

The following code implements this functionality.

Note

A complete sample project is available at https://github.com/DevExpress-Examples/how-to-operate-a-pdf-content-at-runtime-e5025

MainForm.cs
MainForm.vb

string ExtractTextFromPDF(string filePath) {
    string documentText = "";
    try {
        using (PdfDocumentProcessor documentProcessor = new PdfDocumentProcessor()) {
            documentProcessor.LoadDocument(filePath);
            documentText = documentProcessor.Text;
        }
    }
    catch { }
    return documentText;
}

Private Function ExtractTextFromPDF(ByVal filePath As String) As String
    Dim documentText As String = ""
    Try
        Using documentProcessor As New PdfDocumentProcessor()
            documentProcessor.LoadDocument(filePath)
            documentText = documentProcessor.Text
        End Using
    Catch
    End Try
    Return documentText
End Function

Note

The PdfDocumentProcessor.Text property retrieves the content clipped to the crop box. Use the PdfDocumentProcessor.GetText method to get text without clipping. Set the PdfTextExtractionOptions.ClipToCropBox property to false and pass the PdfTextExtractionOptions object as a method parameter.

Extract Text from a Page

Use the PdfDocumentProcessor.GetPageText method to retrieve text from the specified page. This method returns text as a string of lines separated by newlines (“\r\n”). If a document does not contain the specified page, the GetPageText method returns an empty string.

The code sample below extracts text from the first page without clipping:

C#
VB.NET

PdfDocumentProcessor pdfDocumentProcessor = new PdfDocumentProcessor();
pdfDocumentProcessor.LoadDocument("PDF32000_2008.pdf");
string firstPageText = 
processor.GetPageText(1, new PdfTextExtractionOptions { ClipToCropBox = false });

Dim pdfDocumentProcessor As PdfDocumentProcessor = New PdfDocumentProcessor()
pdfDocumentProcessor.LoadDocument("PDF32000_2008.pdf")
Dim firstPageText As String = 
processor.GetPageText(1, New PdfTextExtractionOptions With {
    .ClipToCropBox = False
})

Extract Text from an Area

The PdfDocumentProcessor.GetText method allows you to retrieve text from the specified document area. You can use PdfDocumentPosition objects or the PdfDocumentArea instance to define the area.

The GetText method uses the page coordinate system. Refer to the following help topic for more details: Coordinate Systems.

The code sample below extracts text between two positions on the first page:

C#
VB.NET

using (DevExpress.Pdf.PdfDocumentProcessor processor = new DevExpress.Pdf.PdfDocumentProcessor())
{
    processor.LoadDocument("TextExtraction.pdf");
    PdfDocumentPosition startPosition = new PdfDocumentPosition(1, new PdfPoint(0, 0));
    PdfDocumentPosition endPosition = new PdfDocumentPosition(1, new PdfPoint(500, 500));

    string pageText = 
    processor.GetText(startPosition, endPosition, new PdfTextExtractionOptions { ClipToCropBox = false });
    Console.WriteLine(pageText);
}

Using processor As New DevExpress.Pdf.PdfDocumentProcessor()
  processor.LoadDocument("TextExtraction.pdf")
  Dim startPosition As New PdfDocumentPosition(1, New PdfPoint(0, 0))
  Dim endPosition As New PdfDocumentPosition(1, New PdfPoint(500, 500))

  Dim pageText As String = 
  processor.GetText(startPosition, endPosition, New PdfTextExtractionOptions With {.ClipToCropBox = False})
  Console.WriteLine(pageText)
End Using

Controls

Tools

Controls and Extensions

Tools

Maintenance Mode

Controls

Tools

Controls and Extensions

Tools

Maintenance Mode

Extract All Text

Extract Text from a Page

Extract Text from an Area