Welcome to The Fuzzy-String Project!

Fuzzy-string processing using Damerau-Levenshtein distance, optimized for Microsoft Transact-SQL

How do you find information that was saved misspelled, or when your search is misspelled? Fuzzy-string processing! And if your information is in a database, the best place to do that processing is in the database.

This site is mostly for technical folk, but others may find it useful or instructive. For instance, if you have a word in your Scrabble® tray, Search it to find out what extra letter you could link it with to make a new word! (Of course, other sites can provide that service more quickly, because they have pre-calculated all the information. My intent here is to prove that you can get adequate performance without caching any information.)

This site uses an optimized Transact-SQL form of the Damerau-Levenshtein distance (DLD) algorithm for calculating the difference between strings. Technical details and SQL source code can be found at SQLServerCentral (and will be posted on this site as well, starting 2012-12-17.)

Essentially, DLD is the smallest number of “Changes” required to “Transform” one string into another. Changes can be Inserts (ac→abc), Deletes (abc→ac), Substitutions (abc→adc), or Transpositions (abcd→acbd but not ababab→bababa) of letters. (Changes may be referred to in the code, or on other sites, as Edits, Operations, Typos, Differences, or Cost.)

DLD does not take into account:

Phonetic similarity (words that sound similar, like threw and through)
Tense and pluralization, like was and were or mice and mouse)
Semantic similarity (words that have similar meaning, like car and auto)
Anagrams (words made with the same letters in a different order, like straw and warts)
Nicknames and abbreviations (like Tom and Thomas or Co and Company)
Colloquialisms and slang (like Chevy and Chevrolet or shiv and knife)
Syllabic similarity (parts of word rearranged, like throwback and backthrow)

See the Reference page for other algorithms that may use those kinds of string similarity. However, note that if you use a “Change Limit” (also known as a Threshold) to stop calculation when strings are too different, DLD often needs to only look at the first few letters to discard a potential match, so it will usually be much faster than other algorithms. DLD also has the speed advantage of using only integer addition and subtraction in calculation.

What DLD excels at is finding accidental typos, such as those caused by:

Replacing or adding a letter due to proximity on the keyboard (like retuen and return or treturn and return)
Striking keys in the wrong order due to typing too fast (like retrun and return)
Dropping or doubling a letter due to a malfunctioning key (like capraio and capradio or returrn and return)
Mishearing, misspelling, and words/names with common variations (like color and colour or Stephen and Stephan)

DLD can be especially effective if your database has an alphanumeric key field that users have to type, such as a Vehicle Identification Number (VIN). You can imagine the potential for error when dealing with strings like “4S3BH675017600337” - especially when they can be hard to read because a vehicle is damaged, worn or dirty!

Of course, no matter how optimized, DLD is still slower than basic equality or LIKE pattern-matching. So suggested practice is to use your existing search feature first; then when the user starts entering a new record because they didn't find exactly what they were looking for, launch a DLD search in the background. That way, the user does not need to wait for the DLD search to complete before starting entry, but they can be notified of a potential match before they finish entry and save a new record (which you would have to de-duplicate later).