New PDF Document API: Find, Edit, Redact or Remove Text
- 6 minutes to read
The new PDF Document API allows you to find text in PDF documents and edit the search results. The following actions are available:
- Find text in a PDF document
- Format search results (change font, color, and so on)
- Remove or redact search results
Basic Concept: Search Result Structure
Call the PdfDocument.FindText method to start a search. You can specify the page range and search settings (whole words, case sensitivity, and so on).
The FindText method returns a list of TextSearchInfo objects that contain search results. Each object contains results found on a single page. A TextSearchInfo object returns two lists:
- Matches
- Use this list to obtain search result coordinates on page (rectangles). These rectangles can be used to redact or draw over search results.
- Groups
- Use this list to format search results. This list allows you to access text fragments in the document.
Find Text
The following code snippet searches for the word “DevExpress” on the first three pages of a PDF document:
using DevExpress.Docs.Pdf;
using System.Collections.Generic;
using System.IO;
using (FileStream fileStream =
File.OpenRead(@"Document.pdf"))
{
using (PdfDocument pdfDocument = new PdfDocument(fileStream))
{
// Search for text in the first three pages.
IEnumerable<TextSearchInfo> results =
pdfDocument.FindText("DevExpress",
new TextSearchOptions(true, true), 0, 2);
foreach (TextSearchInfo searchResult in results)
{
// Process search results.
}
pdfDocument.Save(
File.OpenWrite(
@"Document_upd.pdf"));
}
}
Format Search Results
Use the TextSearchInfo.Groups property to get the collection of text groups that match the search pattern. Each TextMatchGroup object contains a Fragment property that returns a TextFragment object. Use the fragment’s properties to change text formatting.
The following code snippet changes the font color of matching text fragments:
image
using DevExpress.Docs.Pdf;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using (FileStream fileStream =
File.OpenRead(@"Document.pdf"))
{
using (PdfDocument pdfDocument =
new PdfDocument(fileStream))
{
IEnumerable<TextSearchInfo> results =
pdfDocument.FindText("DevExpress",
new TextSearchOptions(true, true), 0, 2);
foreach (TextSearchInfo searchResult in results)
{
foreach (var group in searchResult.Groups)
{
// Split the fragment into matched parts.
var newFragments = group.Fragment
.Split(group,
out TextFragment[] matched,
out TextFragment[] notMatched);
// Apply formatting to matched fragments.
foreach (var matchedFragment in matched)
{
matchedFragment.Bold = true;
matchedFragment.ForeColor = Color.Red;
}
pdfDocument.Pages[searchResult.PageIndex]
.Fragments.Replace(
group.Fragment, newFragments);
}
}
pdfDocument.Save(new FileStream(
"Result.pdf", FileMode.Create,
FileAccess.Write));
}
}
Redact Search Results
Use redaction annotations to redact search results. Create a RedactionAnnotation object and specify the text fragment’s rectangle as the annotation’s bounds. Call the PdfDocument.ApplyRedaction method to apply redaction annotations to a PDF document. Once the redaction annotation is applied, it cannot be retrieved, edited, or removed.
You can add call the Page.Annotations.Add method to add a redaction without applying it.
The following code snippet redacts search results:
using DevExpress.Docs.Pdf;
using System.Collections.Generic;
using System.Drawing;
using System.IO;
using System.Linq;
// Load the PDF document.
using (PdfDocument pdfDocument = new PdfDocument(
new FileStream(@"Document.pdf",
FileMode.OpenOrCreate, FileAccess.ReadWrite)))
{
// Search for text in the first three pages.
IEnumerable<TextSearchInfo> results =
pdfDocument.FindText("DevExpress",
new TextSearchOptions(true, true), 0, 2);
// Iterate through search results.
foreach(var result in searchResults) {
List<RedactionAnnotation> annotations = new List<RedactionAnnotation>();
foreach(var match in result.Matches) {
// Get rectangles of matched text fragments.
var rectangles = match.MatchFragments
.Select(f => f.Rectangle).ToArray();
// Create a redaction annotation with specified settings.
var annotation = new RedactionAnnotation(RectangleF.Empty) {
Geometry = new RedactionGeometry(rectangles),
FillColor = PdfColor.Black,
Color = PdfColor.Red,
OverlayText = string.IsNullOrEmpty(redactionText) ? null : redactionText,
CreationDate = DateTime.UtcNow,
TextJustification = TextJustification.LeftJustified,
RepeatText = true,
TextAppearance = new TextAppearance() {
FontSize = 0,
Fill = SolidFill.White,
}};
annotations.Add(annotation);
}
// Apply redaction or add annotations to the page.
if(applyRedaction)
doc.ApplyRedaction(result.PageIndex, annotations.ToArray());
else
foreach(var ann in annotations)
doc.Pages[result.PageIndex].Annotations.Add(ann);
}
// Save the result document.
using (FileStream fileStream = new FileStream(
"Result.pdf", FileMode.Create,
FileAccess.Write)) {
pdfDocument.Save(fileStream);
}
Remove Found Text
The PdfDocument.RemoveText method accepts search results or a string value.
The following code snippet removes the word “DevExpress” from the document:
using DevExpress.Docs.Pdf;
using System.Collections.Generic;
using System.IO;
using (PdfDocument pdfDocument = new PdfDocument(
new FileStream(@"Document.pdf",
FileMode.OpenOrCreate, FileAccess.ReadWrite)))
{
IEnumerable<TextSearchInfo> results =
pdfDocument.FindText("DevExpress",
new TextSearchOptions(true, true), 0, 2);
// Remove the found text.
pdfDocument.RemoveText(results);
pdfDocument.Save(new FileStream(
@"C:\Test Documents\Result.pdf",
FileMode.Create, FileAccess.Write));
}