Background
Our system does a ‘review’ of messages after our operators save them. It checks for things like fields not being filled in where they normally are, but most importantly it checks for spelling mistakes and typos.
We used to use the Telerik Radspell spell checker component in a back end web service. It worked adequately but it had a limited dictionary, didn’t know the Queen’s English (it uses American spellings) and the suggested corrections were often a bit…… random as you can see from the screenshot below. The word column contains the supplied misspelling and the subsequent columns are the suggestions (in order).
How does an average spell checker work?
It’s pretty simple to make a crude spell checker and all you need is a dictionary of correct words. You take each word and check if it exists in the dictionary. If not, you then loop through the dictionary seeing how different each correct word is to the supplied misspelled word. There’s a well used algorithm for seeing how different words are. This is called the Levenstein distance, or ‘edit distance’. Each addition/subtraction/substitution counts as an ‘edit’ For instance, take the misspelling of ‘hosspitle’
hosspitle -> hospitle
hospitle -> hospitae
hospitae -> hospital
That’s an edit distance of 3. The lower the Levenstein distance the more the words are alike.
There’s a slight snag with doing it this way though. If you have even a small dictionary of say 10,000 words you’d need to compare each of the 10,000 words to your misspelling. There’s no real way of pre-computing it as you can’t possibly cater for all misspellings. It’d be quite a costly computational exercise. We can get a much much smaller subset of words to compare by selecting them based on a phonetic algorithm. The most common of which is soundex. This way we can pre-compute the soundex code for all of our known good words.
For example, you can get the soundex value for ‘hosspitle’ using SQL2008 by doing select soundex(‘hosspitle’). This gives a value of H213. If I check the soundex of ‘hospital’ I also get H213. This means that the correct result would be in the subset which is a good start!
Spell Check 2.0…….
Why?
Because crappy looking messages with spelling mistakes and typos don’t give the client a sense of professionalism. The previous spelling corrector called wolf a bit too often and I found that a lot of operators would get used to ignoring it. Also, if it didn’t list the correct suggestion first time around it took them a while to go back into the message, correct the word and resend… there was a temptation to just ignore the mistake and send it anyway.
Improvement #1 – Junking the Telerik engine.
First step is to reproduce what the Telerik spell checker does so I can start to develop my own system. This turned out to be pretty easy. Just find a dictionary of English words on the tinterweb, upload to SQL, create a column for a Soundex field and use the built in SQL Soundex function to pre-compute the soundex’s by doing “update englishwords set soundexvalue = soundex(word)”.
You can then select your word subset back by doing something like “select * from words where soundexvalue = soundex(@mywrongword)”. Using the ‘hosspitle’ example, my dictionary gives me 47 records including
hagbut hasped hispid hackbut hagbuts hawkbit hexapod hackbuts hawkbits hexapods hexapody hiccuped |
hospital hagbuteer hagbutter hiccupped hispidity hospitals hospitium houseboat housebote hospitalizing hospitableness hospitalization |
(No, I don’t know what half of those words are either!)
Anyway, once I get our subset I order it by Edit Distance, so hospital will come amongst the first few search results. This gave me exactly the same results as the Telerik engine and therefore a decent baseline to work from…..
Improvement #2 – Using a bigger dictionary
Bigger is better most of the time, and I needed more words. Searching the internet for a while I found some decent sized CSV’s which had lists of words along with the number of occurrences that word has been seen (this will be useful later). I uploaded this in exactly the same way as improvement 1, into a table called BigDictionary but with a field for the occurrences. The system now uses the previous English words dictionary just to check if it’s a valid word. If I don’t see it in the table I then use the BigDictionary table to retrieve a list of possibilities.
Improvement #3 – Learn words itself
If the spelling corrector uses a fixed dictionary, it doesn’t have a hope of keeping track with the modern world. For example, just looking at the ‘wrong’ words being flagged up by the system as it was at stage 2 I could see it was probably annoying operators. It had flagged up words such as Skype, Mercedes, Google, Ferrari, Bosch, Microsoft, Nokia etc. I wrote a small routine to go through two years worth of messages, separate out each word and upload it to a table. If the word was already in the table, I incremented an ‘occurrences’ field (again, this will be useful later!). I set the system to gate the results so uncommon words don’t appear in the suggestions. This helps to stop any misspellings being learnt as valid words.
I check the BigDictionary table for suggestions, then the LearntWords table and aggregate the suggestions before sorting by edit distance.
Improvement #4 – Double Metaphone
The soundex algorithm is pretty basic and it’s totally reliant on the first letter being correct. This meant that the pre-computed subset was often a bit limited and wouldn’t contain the correct result. After doing a big of research into phonetic algorithms it seemed like Double Metaphone was a good bet and a fair bit more advanced than soundex. I created a Primary and Secondary Metaphone field for all of my dictionary tables so far (including the learnt words table) and made a script to calculate the primary and secondary metaphone values for every word. After an hour of it grinding away I had precomputed values for metaphone as well as soundex. I changed my SQL queries to something like Select * from dictionary where (pm = @pm or sm=@sm or soundex=@soundex). This instantly made the results set bigger and it seemed to get a few more hits particularly if the typo was early on in the word.
Improvement #5 – Weight by Frequency as well as Edit Distance
If you look at the screenshot of the initial results you’ll see the Telerik checker suggested ‘darvon’, ‘driven’, ‘thriven’ for the typo ‘dirven’. This is because it has no idea how common a word is, and it just so happens than ‘darvon’ has the same edit distance from ‘dirven’ as ‘driven’. I have absolutely no idea what a darvon is, and I suspect neither would our callers. Fortunately, in the BigDictionary and my LearntWords tables I have an integer field essentially telling me how common that word is. I decided against simply using the count as a multiplier of a ‘relevancy’ as some words are hugely more common than others and would overwhelm the edit distance… for example if you put ‘thene’ instead of ‘theme’, you’d find that it’d suggest ‘the’ as it’s vastly more common than theme, or even them and then. Instead, I used the ‘position’ as the multiplier, so my SQL became something like:
Select * from dictionary where (pm = @pm or sm=@sm or soundex=@soundex) order by wordcount desc
I then take the results in and do something like:
For Each Result Position += 1 Score = Position * EditDistance(Result,Word) Next
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }
The lower the score, the more relevant the results.
Improvement #6 – Learn from our mistakes
Looking through the log of mistakes and corrections it seemed to be that the same ones were coming up again and again, for example ‘Plesae’ being changed to ‘Please’. It’s pretty obvious, but the system should look at what’s been corrected for that same mistake and bring up the correction in the results. To recap, the process we’re now doing is:
- Check EnglishWords table to see if it’s a common word
- Check LearntMistakes to see if we’ve seen the mistake before, if so, load in the corrections into an array of suggestions
- Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in a previous message
- Search BigDictionary by Soundex and Double Metaphone to see any soundalike words that are in the dictionary
- Score all suggestions retrieved by Edit Distance and Position
Improvement #7 – Weight by source
Now we’re pulling in previous corrections, it’s pretty obvious that some sources are more relevant than others. For example, if I’ve seen ‘plesae’ changed to ‘please’ 80 times, it’s a fair bet when I next see ‘plesae’ they didn’t mean ‘police’, ‘palace’ etc. So, our array of suggestions that is being filled by our LearntMistakes, LearntWords and BigDictionary suggestions now gains a source column, and our weighting code becomes something like:
For Each Result Select Case Source Case PreviousCorrections SourceWeight = 10 Case LearntWords SourceWeight = 15 Case BigDictionary SourceWeight = 20 End Select Position += 1 Score = Position * EditDistance(Result,Word) * SourceWeight Next
.csharpcode, .csharpcode pre
{
font-size: small;
color: black;
font-family: consolas, “Courier New”, courier, monospace;
background-color: #ffffff;
/*white-space: pre;*/
}
.csharpcode pre { margin: 0em; }
.csharpcode .rem { color: #008000; }
.csharpcode .kwrd { color: #0000ff; }
.csharpcode .str { color: #006080; }
.csharpcode .op { color: #0000c0; }
.csharpcode .preproc { color: #cc6633; }
.csharpcode .asp { background-color: #ffff00; }
.csharpcode .html { color: #800000; }
.csharpcode .attr { color: #ff0000; }
.csharpcode .alt
{
background-color: #f4f4f4;
width: 100%;
margin: 0em;
}
.csharpcode .lnum { color: #606060; }Again, the lower the weight, the higher the relevance.
Improvement #8 – Learn words by client not just globally
Some of our clients have industry specific words, for example, if someone phones up to book a car with Supercar Experiences and we see the typo Miserati it’s pretty likely the operator meant Maserati and not Miserable. When I processed the previously seen words from the last few years, I actually created two tables. One that was global across all clients, and one that had a client code on each row, i.e. treat the learnt words separately per company. I use a much lower threshold on this table so the system is quicker to allow learnt words into the suggestions than on the global table. This is purely because any wrong words that get learnt will only appear in suggestions for that company and won’t poison the global dictionaries.
Improvement #9 – Treat transpositions differently
One snag with the Levenshtein distance algorithm as it has no way of detecting transpositions, so ‘ditsance’ is an edit distance of 2 from ‘distance’. Changing to the Damerau–Levenshtein distance algorithm changes that and seemed to massively improve results where it was just a transposition.
Improvement #10 – Context
This is my favourite part……! By now the system is getting pretty smart and the number of messages going out with mistakes is falling rapidly (I re-analyse every message that’s sent after the review process so I can count word errors) but there’s still something missing and sometimes it seems a bit woeful compared to the human brain. We’re pretty good at reading typos and half the time our brain has corrected the word without us noticing.. this is because we know what word to expect. The computer however, doesn’t.
Consider the following sentence: “sending info regarding meeting she had witrh you last month”. We can see they clearly meant with, but the computer has no idea and has to evaluate it without context.
I fed the system a made up message with words in context that it had previously struggled on. The message was “you itnerested in. off hlaf way. some ifno on. refused to leavr number. was looking to spk with accounts. her leter box. meeting he had witrh you last week. llease call regarding”
You can see in the screenshot below that the primary suggestion in word2 field was pretty rotten most of the time:
What if, we had a massive database of text……? Lucky really, we do.
I wrote a routine to go back through our previous messages and split every sentence into three word groups, so the sentence “Wanted to follow up on the meeting he had with you last week” would give us:
Word1 | Word2 | Word3 |
wanted | to | follow |
to | follow | up |
follow | up | on |
up | on | the |
on | the | meeting |
the | meeting | he |
meeting | he | had |
he | had | with |
had | with | you |
with | you | last |
you | last | week |
So there we have it… context. Whizzing through our database of past messages gave me around a million different three word phrases. Again, I used a ‘count’ so if it was a common phrase such as “please call back” I’d just increment the count if it was already in the database.
Then, I added another stage to the spell check, which was find words in context. If I came across an unknown word, I’d simply look in my table of the word phrases by using the surrounding words. For example, if I have ‘please ca regarding’ I’d simply search for any row where word1=please and word3=regarding. Here are some example results:
Please call regarding
Please email regarding
Please contact regarding
I then load all the returned middle words into my array, giving them a low weighting so they score highly
This context method gives the engine a much better idea of what the word could be than previously. Without context, the ‘please ca’ example the suggestion would likely be ‘please can’ which obviously makes no sense if the following word is ‘regarding’ but would make a lot of sense if word3 was ‘you’.
This screenshot shows how much better the results are with an idea of context:
Stage #11 – Always learning
Goes without saying really, but the system continuously learns words and three word phrases from each new message
Stage #12 – Wrong words
The danger with the system learning is that it could learn wrong words. I have a block process and once a week I check for any words that it’s learnt that are above or near the inclusion thresholds to appear in the results. With a single click I can either delete the word from the tables, or delete and block the word from ever being learnt by adding it to a BlackListedWords table.
Summary
The process we’re now doing is:
- Check EnglishWords table to see if it’s a common word
- Check LearntMistakes to see if we’ve seen the mistake before, if so, load in the corrections into an array of suggestions
- Check ThreeWordPhrases using context to see what the word could be
- Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in a previous message for this client
- Search LearntWords by Soundex and Double Metaphone to see any soundalike word we’ve seen before in any previous message (higher threshold)
- Search BigDictionary by Soundex and Double Metaphone to see any soundalike words that are in the dictionary
- Score all suggestions retrieved by Edit Distance, Position and a Source weighting
Conclusion
Has it made any difference? Yes!!
As I mentioned before, I re-analyse every message that’s sent after the review process. To make it fair, I re-analysed the past 3 months worth and did some stats. The number of spelling mistakes and typos was never really very high as we have a very strict QC policy but in percentage terms, going on the two weeks the new system has been in place, the number of mistakes sent out to clients has dropped by 85%. It also speeds up the operators as if they spot a mistake it used to take a while to correct if the suggestions were poor.
All in all, a very worth while exercise and a great learning project… I ended up learning linguistics, re-learning probability and reading some ‘challenging’ research papers!
Very interesting approach. How did you arrive at the weightings – have you tried tweaking them? Also some opportunity here to introduce a naive Bayes classifier to make educated guesses on the 3 word tuples? And how about looking to correct grammar as well as spelling – so that you don’t get “a lot” written in such a way it refers to the mythical beast?