Visual String Transformation using Damerau-Levenshtein distance in T-SQL
The fuzzy-string matrix below shows one of the simplest ways to transform
String1
“bababa”
(rows)
to
String2
“ababab”
(columns),
with
Change Limit:
[None]
[1]
[2]
[3]
[4]
[5]
Switch!
2 Changes are required, so
2 is the
Distance between the strings, which is 33.33% of the longer length.
(A large percentage means the strings are very different;
a small distance means the strings are very similar. More explanation follows the matrix;
technical details and SQL source code can be found at SQLServerCentral.)
Visually follow the simplest path of Changes from top-left to bottom-right in bold.
Each cell displays the cumulative number of Changes required to arrive there,
and what type of Change is required to arrive there from the previous cell, as follows:
- Equality (e): Move diagonally down right, because there is no Change (the letters from String1 and String2 match).
- Substitute (s): Move diagonally down right, by replacing the letter from String1 with the different letter from String2.
- Delete (d): Move down from the previous row, by removing the letter from String1.
- Insert (i): Move right from the previous column, by adding the letter from String2.
- Transpose (t): Move diagonally down and right twice as one Change, by swapping the letter from String1 with the next one.
Calculation of Changes starts from the top-left cell.
(You can hover your mouse pointer over each cell to see the calculations.)
To produce the numbers in each cell,
the first (0th) row and column are initialized with the positions of the letters in the strings.
For each cell after that, the number of Changes required to arrive there is calculated as the minimum of:
- Equality (e): The Changes from the cell diagonally above left (if the letters from String1 and String2 match).
- Substitute (s): The Changes from the cell diagonally above left, plus one (if the letters from String1 and String2 are different).
- Delete (d): The Changes from the cell above, plus one.
- Insert (i): The Changes from the cell to the left, plus one.
- Transpose (t): The Changes from the cell diagonally above left twice, plus one (if the current letters are swapped with the cell above left).
Upon arrival at the bottom-right cell, the total number of Changes is known.
If that number is all you are interested in, you can stop calculation.
But when you want to see the simplest path,
it is determined by working backwards from bottom-right to top-left,
following the direction of the minimum Change in each cell.
(If the top or left edge is reached before the top-left cell,
then the type of Change in the remaining cells is overwritten,
with Inserts or Deletes respectively.)
Shortcuts are available, if you want to stop comparison when the strings are determined to be “too different”
(that is, after more than a certain number of Changes have been identified). You can visualize these by selecting a Change Limit above:
- At any point in the comparison, the best case is that the rest of the strings are identical,
which would mean traveling down the diagonal ending at bottom-right.
Therefore, you can stop at the first of those cells (highlighted silver) that has too many Changes (in red text).
- Once the simplest Change path has diverged too much from the diagonal ending at bottom-right, only more Changes can bring it back.
Therefore, you need only calculate the cells in a stripe within your desired Change Limit
to either side of the diagonal.
The Changes in the uncalculated cells directly above and to the left of the calculated area can be assumed to be the Change Limit plus one.
If the strings are different lengths, the stripe is narrowed by that difference,
which is the distance between the diagonal starting at top-left (highlighted white)
and the diagonal ending at bottom-right,
because the optimal path must stay within the Change Limit of both diagonals.
- Comparison is guaranteed to stop, so there is no need to calculate at all,
if the Change Limit is less than the difference in lengths between the strings.
Also, comparison is guaranteed to succeed, so there is no need to calculate at all,
if the Change Limit is not less than the length of the longer string
(because then every letter could be different).
Caveat: There may be other equally simple ways (same number of Changes) to transform String1 to String2, but no simpler way.
Except, because this is the “direct” Damerau extension of the Levenshtein algorithm (allowing each substring to be modified only once), there may be a simpler way in the following scenarios:
- Transpose two letters, and then insert a sequence of letters between them (TO→OST theoretically could be just 2 Changes, but this implementation will report 3).
- Delete a sequence of letters, and then transpose the two letters that have become adjacent (OST→TO theoretically could be just 2 Changes, but this implementation will report 3).
(We are willing to accept that limitation, because those scenarios are unlikely to be accidental typographical errors, and the intent here is to count unintentional human errors.)
Note also that since only single letters are compared, Unicode Canonical Equivalence of multi-character sequences cannot be considered here
(strings must be normalized before comparison).
Although this site uses a binary collation for comparison (for complete case-sensitivity and accent-sensitivity),
you can choose to implement this code with a different collation.