Spell Check Algorithms
- 4 minutes to read
Implementing a spell-checking engine is more complicated a task than it may seem. It’s evident that simply looping through the vocabulary is not enough, even if this vocabulary is quite comprehensive, and correct.The spell checker should consider the phonetic aspect of the language.
The key points of our spell checking engine are:
While parsing the text, several text elements should be treated specifically. These elements include abbreviations, proper names, figures, e-mail addresses, uniform resource locator (URL) strings (simply web addresses), and so on. They could be simply ignored or checked in a way that is different from other words in the text, depending on the spell checker implementation and user options. The ASPxSpellChecker component provides the ASPxSpellChecker.SettingsSpelling property, which returns an ASPxSpellCheckerSpellingSettings instance that allows a user to avoid checking e-mail and web addresses, words with numbers, mixed case and upper case words.
An ideal dictionary should be comprised of all the words in a given language. In real life, it can be much smaller, and effectively split into several parts, depending on the language. For several Indo-European languages, including English, words are derived from the base by adding affixes - prefixes or postfixes. So, the size of the dictionary can be greatly reduced if the base words, affixes and the rules for adding affixes to base words are placed into separate files. The complete list of words could be built in-place, when necessary. This technique proves its effectiveness especially for synthetic languages (rich in verbal and inflective forms) - Lithuanian or Russian, for example.
The approach that includes the base words and affixes is used in the ISpell and ASpell spelling checker projects. Thanks to the Open Office project, the spellchecker dictionaries of these projects may be freely used and distributed. The ASPxSpellChecker component supports this format, since those dictionaries are quite comprehensive and correct, and constantly amended by cooperative users. The current US-English variant includes more than 62000 base words.
When a word is found to be misspelled (that is, not found in the dictionary), then the spell checker generates a list of suggestions - words suggested to replace the mistake. The final choice is always up to the user.
For more information, see Dictionaries.
Using Near-Miss Strategy to Find Suggestions
The first algorithm implemented by ASPxSpellChecker for building a suggestion list is a near miss strategy. It was developed by Geoff Kuenning for ISpell, and makes an assumption that the word is not necessarily misspelled, but rather mistyped. We change the misspelled word by changing a letter, deleting or adding it, inserting a blank space, or interchanging two adjacent letters. If these steps result in a word contained in the dictionary, then we estimate how far we are from the original word. To measure the proximity of words, the modified Levenshtein distance notion is used.
Using Phonetic Comparison to Find Suggestions
The phonetic suggestion algorithm takes into account the pronunciation of a word. The ASPxSpellChecker component utilizes the implementation of the Double Metaphone search algorithm. Two phonetic codes (primary and secondary) are calculated for each word. The calculation rules are different for different languages. They are based on the set of pronunciation rules for that language.
Then, the phonetic strategy compares the phonetic code of the misspelled word to all the words in the word list. If the phonetic codes match, then the word is added to the suggestion list.
After the list of suggestions is composed, it should be ordered so that the user doesn’t have to scroll through it searching for a perfect match. The implemented solution makes use of the Levenshtein algorithm to calculate the word distance. This distance becomes a parameter for list ordering. Additional assumptions on the nature of a spelling error may help modify the algorithm.
The user makes his choice from the list of suggestions. The misspelled word can be replaced with a word from the suggestion list, ignored, or edited by the user. The last possibility indicates a spell checker miss, and provides an option for appending the corrected word to an auxiliary user dictionary.