How to: Get Coordinates of All Words in a Document

  • 2 minutes to read

The code sample below shows how to use the PdfDocumentProcessor.NextWord method to iterate all words in a document and retrieve their coordinates.

The PdfDocumentProcessor.NextWord method returns an PdfPageWord object. The Rectangles property returns a rectangle encompassing the current word.

Tip

The Rectangles property returns more than one PdfOrientedRectangle object when a part of a word is carried over to the next line. Use the Segments property to obtain information about each part of the word.

// Declare a list to store the word and its coordinates
List<Tuple<string, PdfOrientedRectangle>> WordCoordinates = new List<Tuple<string, PdfOrientedRectangle>>();
using (PdfDocumentProcessor processor = new PdfDocumentProcessor())
{
    processor.LoadDocument("Document.pdf");
    PdfPageWord currentWord = processor.NextWord();
    while (currentWord != null)
    {
        for (int i = 0; i < currentWord.Rectangles.Count; i++)
        {
            // Retrieve the number of the page on which the word
            // is located:
            int pageNumber = currentWord.PageNumber;

            // Retrieve the rectangle encompassing the word
            var wordRectangle = currentWord.Rectangles[i];

            // Add the segment's content and its coordinates to the list
            WordCoordinates.Add(new Tuple<string, PdfOrientedRectangle>(currentWord.Segments[i].Text, wordRectangle));
        }
        // Switch to the next word
        currentWord = processor.NextWord();
    }
}