研究文本比較算法有一段時間了。近日研讀了《A Linear Space Algorithm for Computing Maximal Common Subsequences》(D.S.Hirschberg著)。文章寫於1975年。很多其他的論文都會引用這篇論文,可見這篇論文的質量。同時,該文作者D.S.Hirschberg也寫了很多有關LCS的文章,也都是經典中的經典。
在研讀這篇文章之後,我將它翻譯成中文。由於本人的英語與文法都還不行,故翻譯的質量也就一般了,也歡迎廣大網友指正。
Introduction
導論
The problem of finding a longest common subsequence of two strings has been solved in quadratic time and space. For stings of length 1000 (assuming coefficients of 1 microsecond and 1 byte) the solution would require 106 microseconds (one second) and 106 bytes (1000K bytes). The former is easily accommodated, the latter is not so easily obtainable. If the strings were of length of 10000, the problem might not to be solvable in main memory for lack of space.
過去求解兩個字符串的最長公共子序列的問題,需要花費二次的時間和空間。在求解兩個長1000的字符串(假定時間系數為1微秒,空間系數是1字節)過程中需要106微秒(1秒)和106字節(1000K字節)。前者很容易解決,而後者不是很容易滿足的。如果兩個字符串的長度為10000,則可能沒有足夠的主內存空間來求解這個問題。
注:文章寫於1975年,以當時的計算機的能力來看,1000K是個天量了。不過,就算是現在的計算機,如果沒有良好的算法,在大容量的文本比較時就會出問題。比方說,文本是1M的,在O(MN)的情況下,需要1T的容量。這個可是夠驚人的。
We present an algorithm which will solve this problem in quadratic time and in linear space. For Example, assuming coefficients of 2 microseconds and 10 bytes , for strings of length 1000 we would require 2 seconds and 10K bytes; for strings of length 10000 we would require a little over 3 minutes and 100K bytes.
我們提出一個解決該問題算法,該算法花費二次的時間和線性空間。舉例來說,假定時間系數是2微秒,空間系數為10字節。求解兩個長1000的字符串,我們要花費2秒和10K字節;求解兩個長10000的字符串,我們花費僅僅增加到3分鐘和100K字節。
String C=c1c2……cp is a subsequence of string A=a1a2……am if and only if there is a mapping F:{1,2,……,p}→{1,2,……,m} such that f(i)=k only if ci is ak and F is a monotone strictly increasing function(i.e. F(i)=u,F(j)=v,and i<j imply that u<v)
字符串C=c1c2……cp是字符串A=a1a2……am的子序列,指的是存在一個映射F:{1,2,……,p}→{1,2,……,m},當f(i)=k,則ci=ak,並且F是一個嚴格單調遞增函數(舉例說明:若F(i)=u,F(j)=v,當i<j 則u<v)
String C is a common subsequence of strings A and B if and only if C is a subsequence of A and C is a subsequence of B
字符串C是字符串A和B的公共子序列,當且僅當C既是A的子序列,同時C又是B的子序列
The problem can be stated as follows : Given strings A=a1a2……am and B=b1b2……bn (over alphabet Σ), find a string C= c1c2……cp such that C is a common subsequence of A and B and p is maximized
求解最長公共子序列問題定義如下:給定字符串A=a1a2……am和B=b1b2……bn(覆蓋字符集Σ),找到一個字符串C=c1c2……cp,C是A和B的公共子序列之中p最大的那個
We call C an example of a maximal common subsequence.
我們又把C稱作最大公共子序列
Notation. For string D=d1d2……dr , Dkt is dkdk+1……dt if k≤t; dkdk-1……dt if k≥1. When k>t , we shall write ~Dkt so as to make clear that we are referring to a “reverse substring” of D
標記。對於字符串D=d1d2……dr,Dkt表示為dkdk+1……dt (k≤t);dkdk-1……dt(k≥1),當k>t時,我們標記為~Dkt,稱為D的“反轉子串”
L(i,j) is the maximum length possible of any common subsequence of A1i and B1j
L(i,j)表示為A1i和B1j的所有可能的公共子序列的長度中最大值。
x||y is the concatenation of strings x and y
x||y表示為字符串x和y的連接。
We present the algorithm described in [3], which takes quadratic time and space
我們提到的算法出自[3],它花費二次時間和空間
Algorithm A
算法A
Algorithm A accepts as input strings A1m and B1n and produces as output the matrix L (where the element L(i,j) corresponds to our notation of maximum length possible of any common subsequence of A1i and B1j)
算法A接受輸入字符串A1m 和B1n,並且計算輸出矩陣L(矩陣元素L(i,j)如標記中所稱,表示為A1i和B1j的所有可能的公共子序列的長度中最大值。)
ALG A(m,n,A,B,L)
1. Initialization: L(i,0)←0 [i=0……m];
L(0,j)←0 [j=0……n];
2. for i←1 to m do
begin
3. for j←1 to n do
if A(i)=B(j) then L(i,j)←L(i-1,j-1)+1
else L(i,j)←max{L(i,j-1),L(i-1,j)}
end
Proof of correctness of Algorithm A