Mark Mintoff My superpower is common sense

12Feb/113

Damerau – Levenshtein String Comparison

Fuzzy logic is something which I dabbled into during my stint at Computime, for reasons of which you'll pardon me for not publicizing and it has been a long time since I've had the pleasure of working with Levenshtein Distance algorithms for fuzzy string comparison. What is Levenshtein Distance? The long and short of it is that it is a methodology for measuring the number of differences between two strings. This counts for deletions, insertions and substitutions. An adaptation by Damerau takes into consideration transpositions (writing 'teh' instead of 'the') as a single difference, rather than two as the original Levenshtein algorithm specifies. My adaptation of the DamerauLevenshtein algorithm: What this method will do is allow you to compare two strings to determine the amount of differences there are between them. But an integer value representing changes is not really sufficient for converting this into a percentage threshold to be used for Fuzzy Searches. And thus, I implemented a String Extension; What this does is allow you to check whether a string is 60%+ similar to another string. The math behind this is pretty simple; The Levenshtein Distance is divided by the greatest length of the two strings. This produces the percentage difference. It is then subtracted from 1.0 to produce the similarity. This can of course be substituted for: if you want to save on a single mathematical operation. For the sake of clarity and readability however, I am sticking to the representation of similarity. I ran a test using a predefined set of words to compare against;
  • Behaviour
  • Projection
  • Transposition
  • Rebellious
  • Precipitation
  • Precipice
  • Kleptomaniac
  • Serotonin
  • Bicycle
  • Queasy
  • Flabbergasted
  • Destroyers
If I compare using the word "behavio" I get "Behaviour". If I compare using the word "precipit" I get "Precipitation" and "Precipice". If I compare using the word  "reballous" I get "Rebellious". If I compare using the word "projekt" I get "Projection". If I compare using the word "rebel" I get nothing. Why? Rebellious is a 10 letter word. By leaving out "lious" from the search, I am only searching for half the word. Naturally this results in a 50% similarity. Since I defined that the word has to be 60%+ similar, the word is not considered to be similar. And there you have it; an algorithm for use in string comparisons and a string extension to compare two strings for similarity. Happy coding!
VN:F [1.9.22_1171]
Rating: 5.0/5 (2 votes cast)