Spell-Check Algorithms
- 4 minutes to read
The main aspects of our spell-check engine implementation include:
Text analysis
When parsing the text, some text elements can be treated specifically. These elements include abbreviations, proper names, numbers, e-mail addresses, URLs (web addresses), and others. They could simply be ignored or checked in a different way from other words in the text, dependent upon the spell-checker implementation and end-user options.
The ExpressSpellChecker provides spelling options that allow an end-user to avoid checking e-mail and web addresses, words with numbers, mixed case and uppercase words.
Dictionaries used by the spell checker
ISpell dictionary. An ideal dictionary is comprised of all the words in a given language. In real life, it can be much smaller and effectively split into several parts, dependent upon the language. For several Indo-European languages, including English, words are derived from the base by adding affixes. So, the size of the dictionary can be greatly reduced if base words, affixes and rules for adding affixes to base words are placed into separate files. The complete list of words could be built in-place, when necessary. This technique proves its effectiveness especially for synthetic languages (rich in verbal and inflective forms). The approach that includes base words and affixes is used in the ISpell (ASpell) and Hunspell (OpenOffice) spell-checker projects.
Hunspell (OpenOffice) dictionary. Thanks to the Hunspell project, spell-checker dictionaries of these projects may be freely used and distributed. The ExpressSpellChecker supports this format since those dictionaries are quite complete and correct, and constantly amended by cooperative end-users.
User-defined dictionaries. These dictionaries support the plain text format, and they are mostly used as stores for words not contained in other dictionaries.
Spelling suggestion algorithms
When a word is found as misspelled (that is, not found in a dictionary), a spell-checker generates a list of suggestions – words supposed to replace the mistake. The final choice is always up to an end-user.
To provide suggested replacements for the misspelled word, the ExpressSpellChecker implementation uses the following correction algorithms:
- Near-Miss Strategy
The misspelled word is subsequently modified by changing a letter, deleting or adding it, inserting a blank space, or interchanging two adjacent letters. If these steps result in a word contained in a dictionary, then the “distance” between two words is calculated as a sum of steps required to perform the transformation.
- Replacement Table
The Replacement Table contains patterns of frequently misspelled words. With this algorithm, the search for suggested words takes less time, and results in more relevant corrections.
The Replacement Table algorithm has a higher priority than the Near-Miss Strategy algorithm. As a result, the distance between the misspelled and suggested words is always 1.
- Phonetic Comparison
The phonetic suggestion algorithm takes into account the pronunciation of a word. It compares the phonetic code of the misspelled word to all the words in the word list. If the phonetic codes match, then the word is added to the suggestion list.
The ExpressSpellChecker utilizes the Double Metaphone search algorithm. This algorithm is used by the spell checker only if the following conditions are met:
The dictionary language is English.
The spell checker’s MetaphoneDistance property value is 2 or more.
No suggestions can be found using the Near-Miss Strategy and Replacement Table algorithms (the spell checker uses these algorithms first).
When composing a suggestion list with the Double Metaphone algorithm, the spell checker selects words from a dictionary based on the distance specified by the spell checker’s MetaphoneDistance property.
- N-Gram algorithm
A built-in algorithm analyzes a misspelled word using an n-gram model.
The N-Gram algorithm is used by the Hunspell dictionary. This algorithm is applied, if the dictionary’s SuggestionsLiteMode option is active, or if no suggestions can be found using the Near-Miss Strategy algorithm.
Some of the algorithms above are specific to certain formats. In different dictionaries, these algorithms are either used together or interchangeably.
Suggestion list order
After the list of suggestions is composed it needs to be ordered, so that an end-user doesn’t have to scroll through it searching for a match. The implemented solution makes use of a modified Levenshtein distance algorithm, to provide the most relevant corrections in the list.