Problem: Given two strings of size m, n and set of operations replace (R), insert (I) and delete (D) all at equal cost. Find minimum number of edits (operations) required to convert one string into another.
Identifying Recursive Methods:
What will be sub-problem in this case? Consider finding edit distance of part of the strings, say small prefix. Let us denote them as [1...i] and [1...j] for some 1< i < m and 1 < j < n. Clearly it is solving smaller instance of final problem, denote it as E(i, j). Our goal is finding E(m, n) and minimizing the cost.
In the prefix, we can right align the strings in three ways (i, -), (-, j) and (i, j). The hyphen symbol (-) representing no character. An example can make it more clear.
Given strings SUNDAY and SATURDAY. We want to convert SUNDAY into SATURDAY with minimum edits. Let us pick i = 2 and j = 4 i.e. prefix strings are SUN and SATU respectively (assume the strings indices start at 1). The right most characters can be aligned in three different ways.
Case 1: Align characters U and U. They are equal, no edit is required. We still left with the problem of i = 1 and j = 3, E(i-1, j-1).
Case 2: Align right character from first string and no character from second string. We need a deletion (D) here. We still left with problem of i = 1 and j = 4, E(i-1, j).
Case 3: Align right character from second string and no character from first string. We need an insertion (I) here. We still left with problem of i = 2 and j = 3, E(i, j-1).
Combining all the subproblems minimum cost of aligning prefix strings ending at i and j given by
E(i, j) = min( [E(i-1, j) + D], [E(i, j-1) + I], [E(i-1, j-1) + R if i,j characters are not same] )
We still not yet done. What will be base case(s)?
When both of the strings are of size 0, the cost is 0. When only one of the string is zero, we need edit operations as that of non-zero length string. Mathematically,
E(0, 0) = 0, E(i, 0) = i, E(0, j) = j
Now it is easy to complete recursive method. Go through the code for recursive algorithm (edit_distance_recursive).
Dynamic Programming Method:
We can calculate the complexity of recursive expression fairly easily.
T(m, n) = T(m-1, n-1) + T(m, n-1) + T(m-1, n) + C
The complexity of T(m, n) can be calculated by successive substitution method or solving homogeneous equation of two variables. It will result in an exponential complexity algorithm. It is evident from the recursion tree that it will be solving subproblems again and again. Few strings result in many overlapping subproblems (try the below program with strings exponential and polynomial and note the delay in recursive method).
We can tabulate the repeating subproblems and look them up when required next time (bottom up). A two dimensional array formed by the strings can keep track of the minimum cost till the current character comparison. The visualization code will help in understanding the construction of matrix.
The time complexity of dynamic programming method is O(mn) as we need to construct the table fully. The space complexity is also O(mn). If we need only the cost of edit, we just need O(min(m, n)) space as it is required only to keep track of the current row and previous row.
Usually the costs D, I and R are not same. In such case the problem can be represented as an acyclic directed graph (DAG) with weights on each edge, and finding shortest path gives edit distance.
Applications:
There are many practical applications of edit distance algorithm, refer Lucene API for sample. Another example, display all the words in a dictionary that are near proximity to a given wordincorrectly spelled word.
package DP; import java.util.Arrays; /** * Edit Distance * * Given two words word1 and word2, find the minimum number of steps required to convert word1 to word2. (each operation is counted as 1 step.) You have the following 3 operations permitted on a word: a) Insert a character b) Delete a character c) Replace a character */ public class EditDistance { static int[][] dist = null; public static void main(String[] args) { String word1 = sdfsssjdfhsb; String word2 = cvdsfadfgkdfgj; dist = new int[word1.length()+1][word2.length()+1]; for(int[] row : dist){ Arrays.fill(row, -1); } System.out.println(minDistance(word1, word2)); System.out.println(minDistance2(word1, word2)); } public static int min3(int a, int b, int c){ return Math.min(Math.min(a, b), c); } // DP bottom-up public static int minDistance(String word1, String word2) { int[][] distance = new int[word1.length()+1][word2.length()+1]; // 邊界情況:當其中一個string為空時,只要一直添加或刪除就可以 for(int i=0; i<=word1.length(); i++){ distance[i][0] = i; } for(int j=1; j<=word2.length(); j++){ distance[0][j] = j; } // 遞推,[i][j]處可以由左,上,左上3種情況而來 for(int i=1; i<=word1.length(); i++){ for(int j=1; j<=word2.length(); j++){ distance[i][j] = min3(distance[i-1][j]+1, // 從上演變 distance[i][j-1]+1, // 從左演變 distance[i-1][j-1]+(word1.charAt(i-1)==word2.charAt(j-1) ? 0 : 1)); // 從左上演變,考慮是否需要替換 } } return distance[word1.length()][word2.length()]; // 返回右下角 } // 遞歸,太慢 public static int minDistance2(String word1, String word2) { return rec(word1, word1.length(), word2, word2.length()); // return rec2(word1, word1.length(), word2, word2.length()); } public static int rec(String word1, int len1, String word2, int len2){ if(len1 == 0){ return len2; } if(len2 == 0){ return len1; } if(word1.charAt(len1-1) == word2.charAt(len2-1)){ return rec(word1, len1-1, word2, len2-1); }else{ return min3(rec(word1, len1-1, word2, len2-1) + 1, rec(word1, len1, word2, len2-1) + 1, rec(word1, len1-1, word2, len2) + 1); } } // 添加全局數組,保存狀態,用空間換時間 DP top-down public static int rec2(String word1, int len1, String word2, int len2){ if(len1 == 0){ return len2; } if(len2 == 0){ return len1; } if(word1.charAt(len1-1) == word2.charAt(len2-1)){ if(dist[len1-1][len2-1] == -1){ dist[len1-1][len2-1] = rec2(word1, len1-1, word2, len2-1); } return dist[len1-1][len2-1]; }else{ if(dist[len1-1][len2-1] == -1){ dist[len1-1][len2-1] = rec2(word1, len1-1, word2, len2-1); } if(dist[len1][len2-1] == -1){ dist[len1][len2-1] = rec2(word1, len1, word2, len2-1); } if(dist[len1-1][len2] == -1){ dist[len1-1][len2] = rec2(word1, len1-1, word2, len2); } dist[len1][len2] = min3(dist[len1-1][len2-1]+1, dist[len1][len2-1]+1, dist[len1-1][len2]+1); return dist[len1][len2]; } } }