Longest common subsequence problem
Encyclopedia : L : LO : LON : Longest common subsequence problem
The longest-common subsequence problem is the search for an efficient method of finding the longest common subsequence (LCS). It is a classic computer science problem and the basis of diff. More recently, it has gained prominence thanks to its applications in bioinformatics.
It should not be confused with the longest common substring problem: a substring is contiguous. There are at most k*(k+1)/2 + 1 substrings of a string X of length k. A subsequence is not necessarily contiguous. There are at most choose(n,k) subsequences of a string X of length k.
A naive method of searching for LCS would be to employ a brute force policy: Given a sequence X, determine all possible subsequences of X, and check to see if each subsequence was a subsequence of Y, keeping track of the longest subsequence found. Each subsequence of X would be in the set of . Basic number theory results demonstrate that there are O(2k) subsequences of X, making this algorithm exponential time and therefore impracticably inefficient for long sequences, such as human DNA strands.
Polynomial-time algorithms (e.g., based on dynamic programming) are known for the variant of the problem where the number of input sequences is fixed. The problem is NP-hard for the general case of arbitrary (i.e., not a priori fixed) number of input sequences even in the case of the binary alphabet. D. Maier, "The complexity of some problems on subsequences and supersequences", J. ACM 25 (1978) 322–336.
Four steps to LCS, polynomial time edition
Analyze LCS properties
Many computer scientists have written papers on LCS properties, including one where LCS has an optimal-substructure property.The Optimal-Substructure of an LCS Theorem is
Let X = < x1,...,xm > and Y = < y1,...,yn > be sequences, and let Z = < z1,...,zk > be any LCS of X and Y.
- If xm = yn, then zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1.
- If xm ≠ yn and zk ≠ xm, then Z is an LCS of Xm-1 and Y.
- If xm ≠ yn and zk ≠ yn, then Z is an LCS of X and Yn-1.
Devise a
Computer scientists have agreed on the formula:
- [ c[i,j] = \left\ 0, & \mbox \\ c[i-1,j-1]+1, & \mbox>\mboxx_i = y_j \\ max(c[i,j-1],c[i-1,j]) & \mbox>\mboxx_i \ne y_j \end\right. ]
The recursive formula relies on the Optimal Substructure Theorem in Part 1. It involves establishing a recurrence for the optimal value. c[i,j] is the length of the LCS in sequences Xi and Yj. The first part of the formula says if either one of the sequences has length 0, then there can be no proper subsequence of a null sequence. The other two parts breaks the LCS apart into smaller subproblems until we reach the null sequence.
Compute the LCS
- Several algorithms are known; some involve exponential time functions, others are in polynomial time. One such polynomial function is the Generic LCS Delta algorithm.
LCS-Delta(X,Y) m <- LENGTH[X] n <- LENGTH[Y] for i <- 1 to m, do c[i,0] <-0; for j <- 0 to n, do c[0,j] <-0; b <- c; for i <- 1 to m, do else if (c[i-1,j] >= c[i,j-1]) else } } return c and b
- LCS Delta is an algorithm that takes two sequences X and Y as inputs. The algorithm creates a table c and with the lengths of X and Y. The algorithm also has a table b, which is a copy of c, which is used to store the optimal solution of the LCS. The rest of the algorithm uses the formula in Step 2 to compute the LCS, and populate the table. The algorithm runs in O(mn) time, where m and n are the length of X and Y respectively. There are more efficient algorithms, namely LCS Alpha and LCS Beta, but this algorithm is the most intuitive.
Construct the LCS
- The b-table returned by the algorithm is used to construct the LCS. The b-table could look something like this:
- [\begin ! & 0 & 0 & 0 & 0 & 0\\ 0 & UP-LEFT & LEFT & 0 & 0 & 0 \\ 0 & 0 & 0 & UP-LEFT & 0 & 0\\ 0 & 0 & 0 & 0 & UP-LEFT & 0 \\ 0 & 0 & 0 & 0 & UP & LEFT \\\end]
Modern LCS search methods have yielded algorithms which have cut down the exponential time to polynomial time. The LCS continues to be examined by computer scientists, trying to find an even faster time, perhaps one in logarithmic-linear time.
Finding the List of Edits (Diff algorithm)
The following algorithm will give the list of edits needed to make Y equal to X (the diff algorithm):Notation: [n,m] refers to a LCS component at X[n] and Y[n]
Assumptions:
The LCS as found by findLCS() is placed on a stack moving from the lower-right corner and following the direction arrows to form the LCS.
oldElem <---- [-1,-1] while stack is not empty elem <---- pop from stack // diff does not require LCS components but add to list here if desired for i=oldElem[n] to elem[n] add delete X(i) from list of edits for i=oldElem[m] to elem[m] add insert Y(i) to list of editsNote: Diff(1) will treat each "gap" between LCS components as a "hunk"
References
- A421: SR10, pg.228.
External links
- [Dictionary of Algorithms and Data Structures: longest common subsequence]
- [GPL implementation of the LCS-Delta algorithm]
- Project Dedupe http://dedupe.sourceforge.net
From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.
