What is Spell correction ?
Spelling correction is the process of detecting and providing the suggestion for misspelled words in text. The process consists of two distinct tasks — detecting and correcting the error. We will be discussing some of the NLP techniques that can be used to achieve spell correction.
Types of Spell Errors
There are different types of spelling errors. Let’s try to classify them a bit formally.
1.Non-word Errors: These are the very common types of errors. You either miss a few alphabets or let your fingers hurtle a bit longer. E.g., typing compilation when you meant completion; or off when you meant of.
2.Real Word Errors: Sometimes instead of creating a non-word, you end up creating a real word, but one you didn’t intend. E.g, typing compete when you meant complete, or your fingers are a tad wonky, and you type in three when you mean there.
3. Cognitive Errors: The previous two types of errors result not from ignorance of a word or its correct spelling. Cognitive errors can occur due to these factors.
The words idle and ideal are homophones (sound the same). So you are not sure which one you have to write. Sometimes you’re damn sure about your spellings despite a little grammar you don’t know.
4. Short forms/Slang/Lingo: These are possibly not even spelling errors. Maybe you are just trying to say gnsdtc or you are trying hard to fit in everything within a text message or a tweet and must commit a spelling sin. We mention them here for completeness.
Error detection represents a set of behavior which spellers use internally to monitor and review ongoing processes while spelling words. Error detection may also refer to overt scanning of previously written words to determine if errors are present.
On-line bibliographic search systems tend to extend the visibility of spelling errors through the utilization of indexes of unique terms; even low error rates during a database may result in a large number of misspelled terms in these indexes.
A computer program for spelling error detection achieved a high level of performance using hashing techniques for dictionary look-up and compression.
Error detection techniques:
1.Dictionary Lookup Technique-
In this, Dictionary lookup technique is employed which checks every word of input text for its presence in the dictionary. If that word is present in the dictionary, then it is a correct word. Otherwise it’s put into the list of error words. The use of most common technique for gaining fast access to a dictionary is the Hash Table. To look up an input string, one simply computes its hash addresses and retrieves the word stored at that address within the pre- constructed hash table. If the word stored at the hash address is different from the input string or is null, a misspelling is can occur.
2.N-gram Analysis Technique-
The N-grams are n-letter sub sequences of words or strings where n usually is one, two or three. One letter n- grams are referred to as unigrams or monograms; two letter n-grams are referred to as bi-grams and three letter n-grams as trigrams. In general, n-gram detection technique work by examining each n-gram is an input string and looking it up in a precompiled table of n-gram statistics to as certain either its existence or its frequency of words or strings that are found to contain non- existence or highly infrequent n-grams are identified as either misspellings.
Without context Correction
Spelling Correction without Context is also known as isolated-term correction. In this method, we aim to correct one query at a time even when we are dealing with a multiple-term query. For example, we may also want to retrieve documents containing a term ‘berth’ when the user types the query ‘birth’. Such an isolated-term correction would not be detected since each term in the query is spelled in isolation.
The two techniques for handling isolated-term correction are:
- Edit Distance
- K-gram Overlap
Edit Distance -
Among various alternatives for correct spellings of a mis-spelled query, this method chooses the nearest one. This requires that we have an idea of closeness or nearness between the queries.
The edit distance between two words, w1 and w2, is the lowest number of edit operations that are needed to convert w1 into w2. Generally, the edit operations are:
- Insert a character
- Remove a character
- Replace a character by another character
For example, the edit distance between ‘dig’ and ‘dog’ is 1 and that between ‘January’ and ‘February’ is 4.
K-gram Overlap -
At the point when two correctly spelled queries are tied (or almost tied), we select the one that is more common.
Here, K-grams are k-length subsequences of a string. Thus, the k-gram overlap is used to find the most likely spelling corrections.
For example, ‘bolder’ and ‘boulder’ both seem equally probable corrections for ‘boder’.
The algorithm then selects the more common of ‘bolder’ and ‘boulder’ as the correction.
The number of occurrences of the word is taken into consideration to determine the same.
Language Model — Language modelling is task of predicting the probabilities of a sequence of words i.e. sentences in a language.
Types of Language Model -
- Statistical — It uses statistical techniques to understand probability distribution of words
- Neural — It uses different neural networks
We would be covering n-gram under statistical Language Model.
n-gram : It is the method which uses sequence of n words.
Consider an eg- ‘There is a car’
Here ‘There’, ‘is’, ‘a’, ‘car’ are one word sequences which are known as unigram. Similarly if we consider the sequence of two words it’s termed bigram and so on.
Probability of sentence using chain rule of probability is given by-
The probability can be calculated by relative frequency count i.e.
The better formal way of defining probability is-
This is the product of the number of conditional probability known as joint probability.
Instead of considering the entire history we can just use the last few words to approximate the probability. Consider for a bigram,
Markov Assumption — The probability of a word depends only on the previous word.
Issue- The issue with n-gram model is that everything we compute is in log space in case of probability but when we multiply more the probabilities the product becomes smaller that results in underflow. Also adding in log space is equivalent to multiplying in linear space and addition is faster than multiplication. Handling of zero is also an issue.
The zero can be handled by smoothing. The ways of smoothing are Laplace (add one) smoothing, add-k smoothing, etc.
Aditi Gangurde, Akanksha Shrikhande, Srushti Potnurwar
Spelling Detection Errors Techniques in NLP: A Survey — By Rasha Altarawneh
Towards the Natural Language Processing as Spelling Correction for Offline Handwritten Text Recognition Systems — Arthur Flor de Sousa Neto, Byron Leite Dantas Bezerra, and Alejandro Héctor Toselli
Auto Spelling Checker using Natural Language Processing — Chinmay Patil, Rexson Rodrigues, Reuben Ron