How to: Get Coordinates of All Words in a Document
- 2 minutes to read
The code sample below shows how to use the PdfDocumentProcessor.NextWord method to iterate all words in a document and retrieve their coordinates.
The PdfDocumentProcessor.NextWord method returns an PdfPageWord object. The Rectangles property returns a rectangle encompassing the current word.
Tip
The Rectangles property returns more than one PdfOrientedRectangle object when a part of a word is carried over to the next line. Use the Segments property to obtain information about each part of the word.
// Declare a list to store the word and its coordinates
List<Tuple<string, PdfOrientedRectangle>> WordCoordinates = new List<Tuple<string, PdfOrientedRectangle>>();
using (PdfDocumentProcessor processor = new PdfDocumentProcessor())
{
processor.LoadDocument("Document.pdf");
PdfPageWord currentWord = processor.NextWord();
while (currentWord != null)
{
for (int i = 0; i < currentWord.Rectangles.Count; i++)
{
// Retrieve the number of the page on which the word
// is located:
int pageNumber = currentWord.PageNumber;
// Retrieve the rectangle encompassing the word
var wordRectangle = currentWord.Rectangles[i];
// Add the segment's content and its coordinates to the list
WordCoordinates.Add(new Tuple<string, PdfOrientedRectangle>(currentWord.Segments[i].Text, wordRectangle));
}
// Switch to the next word
currentWord = processor.NextWord();
}
}