Opentopia Directory Encyclopedia Tools

Suffix array

Encyclopedia : S : SU : SUF : Suffix array



 

In computer science, a suffix array is an array giving the suffixes of a string in lexicographical order.

Details

Consider the string "abracadabra", of length 11. It has eleven suffixes: "abracadabra", "bracadabra", "racadabra", and so on down to "a". Sorted into lexicographical order, these suffixes are

a
abra
abracadabra
acadabra
adabra
bra
bracadabra
cadabra
dabra
ra
racadabra
If the original string is available, each suffix can be completely specified by the zero-based or one-based index of its first character. The suffix array is the array of these indices in lexicographical order. For the string "abracadabra", using one-based indexing, the suffix array is , because the suffix "a" begins at the 11th character, "abra" begins at the 8th character, and so forth.

One may also treat the string "abracadabra" as having a twelfth suffix, the empty string (with index 12). But since the empty string will always sort lexicographically before every other suffix, we lose no information by omitting it.

Algorithms

The easiest way to construct a suffix array is to use an efficient comparison sort algorithm. This requires [O(n \log n)] suffix comparisons, but a suffix comparison requires [O(n)] time, so the overall runtime of this approach is [O(n^2 \log n)]. More sophisticated algorithms improve this to [O(n \log n)] by exploiting the results of partial sorts to avoid redundant comparisons. Several [\Theta(n)] algorithms are known, but are rarely used in practice because of complexity, high memory requirements, and inefficiency (high constant factors).

Applications

The suffix array of a string can be used as an index to quickly locate every occurrence of a substring within the string. Finding every occurrence of the substring is equivalent to finding every suffix that begins with the substring. Thanks to the lexicographical ordering, these suffixes will be grouped together in the suffix array, and can be found efficiently with a binary search. If implemented straightforwardly, this binary search takes [O(m \log n)] time, where [m] is the length of the substring. To avoid redoing comparisons, extra data structures giving information about the longest common prefixes (LCPs) of suffixes are constructed, giving [O(m + \log n)] search time.

Suffix sorting algorithms can be used to perform the Burrows-Wheeler transform (BWT). Technically the BWT requires sorting cyclic permutations of a string, not suffixes. We can fix this by appending to the string a special end-of-string character which sorts lexicographically before every other character. Sorting cyclic permutations is then equivalent to sorting suffixes.

History

Suffix arrays were originally developed by Gene Myers and Udi Manber to reduce memory consumption compared to a suffix tree. This began the trend towards compressed suffix arrays and BWT-based compressed full-text indices.

References

 


From Wikipedia, the Free Encyclopedia. Original article here. Support Wikipedia by contributing or donating.
All text is available under the terms of the GNU Free Documentation License See Wikipedia Copyrights for details.


Search Titles
0123456789
ABCDEFGHIJ
KLMNOPQRST
UVWXYZ?

E-mail this article to:

Personal Message: