skip to main content
article

A taxonomy of suffix array construction algorithms

Authors Info & Claims
Published:06 July 2007Publication History
Skip Abstract Section

Abstract

In 1990, Manber and Myers proposed suffix arrays as a space-saving alternative to suffix trees and described the first algorithms for suffix array construction and use. Since that time, and especially in the last few years, suffix array construction algorithms have proliferated in bewildering abundance. This survey paper attempts to provide simple high-level descriptions of these numerous algorithms that highlight both their distinctive features and their commonalities, while avoiding as much as possible the complexities of implementation details. New hybrid algorithms are also described. We provide comparisons of the algorithms' worst-case time complexity and use of additional space, together with results of recent experimental test runs on many of their implementations.

References

  1. Abouelhoda, M. I., Kurtz, S., and Ohlebusch, E. 2004. Replacing suffix trees with suffix arrays. J. Disc. Algor. 2, 1, 53--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apostolico, A. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words. NATO ASI Series F12. Springer-Verlag, Berlin, Germany, 85--96.Google ScholarGoogle Scholar
  3. Baron, D. and Bresler, Y. 2005. Antisequential suffix sorting for BWT-based data compression. IEEE Trans. Comput. 54, 4 (Apr.), 385--397. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bentley, J. L. and McIlroy, M. D. 1993. Engineering a sort function. Softw. Pract. Exper. 23, 11, 1249--1265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Bentley, J. L. and Sedgewick, R. 1997. Fast algorithms for sorting and searching strings. In Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms (New Orleans, LA). ACM, New York, 360--369. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Burkhardt, S. and Kärkkäinen, J. 2003. Fast lightweight suffix array construction and checking. In Proceedings of the 14th Annual Symposium CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin, Germany, 55--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Burrows, M. and Wheeler, D. J. 1994. A block sorting lossless data compression algorithm. Tech. Rep. 124, Digital Equipment Corporation, Palo Alto, CA.Google ScholarGoogle Scholar
  8. Crauser, A. and Ferragina, P. 2002. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1--35.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Farach, M. 1997. Optimal suffix tree construction for large alphabets. In Proceedings of the 38th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society, Los Alamitos, CA, 137--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ferragina, P. and Grossi, R. 1999. The string b-tree: a new data structure for search in external memory and its applications. J. ACM 46, 2, 236--280. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Grossi, R. and Vitter, J. S. 2005. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35, 2, 378--407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Hart, M. 1997. Project Gutenberg. http://www.gutenberg.net.Google ScholarGoogle Scholar
  13. Hon, W., Sadakane, K., and Sung, W. 2003. Breaking a time-and-space barrier in constructing full-text indices. In Proceedings of the 44th IEEE Symposium on Foundations of Computer Science (FOCS'03). IEEE Computer Society Press, Los Alamitos, CA, 251--260. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Itoh, H. and Tanaka, H. 1999. An efficient method for in memory construction of suffix arrays. In Proceedings of the 6th Symposium on String Processing and Information Retrieval (Cancun, Mexico). IEEE Computer Society, Los Alamitos, CA, 81--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Kärkkäinen, J. and Sanders, P. 2003. Simple linear work suffix array construction. In Proceedings of the 30th International Colloquium Automata, Languages and Programming. Lecture Notes in Computer Science, vol. 2971. Springer-Verlag, Berlin, Germany, 943--955. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Kärkkäinen, J., Sanders, P., and Burkhardt, S. 2006. Linear work suffix array construction. Journal of the ACM 53, 6 (Nov.), 918--936. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Karp, R. M., Miller, R. E., and Rosenberg, A. L. 1972. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings of the 4th Annual ACM Symposium on Theory of Computing (Denver, CO). ACM, New York, 125--136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Kasai, T., Lee, G., Arimura, H., Arikawa, S., and Park, K. 2001. Linear-time longest-common-prefix computation in suffix arrays and its applications. In Proceedings of the 12th Annual Symposium (CPM 2001). Lecture Notes in Computer Science, vol. 2089. Springer-Verlag, Berlin, Germany, 181--192. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Khmelev, D. V. 2003. Program suffsort version 0.1.6. http://www.math.toronto.edu/dkhmelev/PROGS/tacu/suffsort-eng.html.Google ScholarGoogle Scholar
  20. Kim, D. K., Jo, J., and Park, H. 2004. A fast algorithm for constructing suffix arrays for fixed-size alphabets. In Proceedings of the 3rd Workshop on Experimental and Efficient Algorithms (WEA 2004), C. C. Ribeiro and S. L. Martins, Eds. Springer-Verlag, Berlin, Germany, 301--314.Google ScholarGoogle ScholarCross RefCross Ref
  21. Kim, D. K., Sim, J. S., Park, H., and Park, K. 2003. Linear-time construction of suffix arrays. In Proceedings of the 14th Annual Symposium Combinatorial Pattern Matching, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin, Germany, 186--199. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kim, D. K., Sim, J. S., Park, H., and Park, K. 2005. Constructing suffix arrays in linear time. J. Discrete Algorithms 3, 126--142.Google ScholarGoogle ScholarCross RefCross Ref
  23. Ko, P. 2006. Linear time suffix array. http://www.public.iastate.edu/~kopang/progRelease/homepage.html.Google ScholarGoogle Scholar
  24. Ko, P. and Aluru, S. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual Symposium CPM 2003, R. Baeza-Yates, E. Chávez, and M. Crochemore, Eds. Lecture Notes in Computer Science, vol. 2676. Springer-Verlag, Berlin, Germany, 200--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ko, P. and Aluru, S. 2005. Space efficient linear time construction of suffix arrays. J. Disc. Algor. 3, 143--156.Google ScholarGoogle ScholarCross RefCross Ref
  26. Kurtz, S. 1999. Reducing the space requirement of suffix trees. Softw. Pract. Exper. 29, 13, 1149--1171. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Larsson, J. N. and Sadakane, K. 1999. Faster suffix sorting. Tech. Rep. LU-CS-TR:99-214 {LUNFD6/(NFCS-3140)/1-20/(1999)}, Department of Computer Science, Lund University, Sweden.Google ScholarGoogle Scholar
  28. Lee, S. and Park, K. 2004. Efficient implementations of suffix array construction algorithms. In AWOCA 2004: Proceedings of the 15th Australasian Workshop on Combinatorial Algorithms, S. Hong, Ed. 64--72.Google ScholarGoogle Scholar
  29. Malyshev, D. 2006. DARK the universal archiver based on BWT-DC scheme. http://darchiver.narod.ru/.Google ScholarGoogle Scholar
  30. Manber, U. and Myers, G. W. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of the 1st ACM-SIAM Symposium on Discrete Algorithms. ACM, New York, 319--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Manber, U. and Myers, G. W. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5, 935--948. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Maniscalco, M. A. 2005. MSufSort. http://www.michael-maniscalco.com/msufsort.htm.Google ScholarGoogle Scholar
  33. Maniscalco, M. A. and Puglisi, S. J. 2006. Faster lightweight suffix array construction. In Proceedings of 17th Australasian Workshop on Combinatorial Algorithms, J. Ryan and Dafik, Eds. Univ. Ballavat, Ballavat, Victoria, Australia, 16--29.Google ScholarGoogle Scholar
  34. Maniscalco, M. A. and Puglisi, S. J. 2007. An efficient, versatile approach to suffix sorting. ACM J. Experiment. Algor. To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Manzini, G. 2004. Two space saving tricks for linear time LCP computation. In Proceedings of 9th Scandinavian Workshop on Algorithm Theory (SWAT '04), T. Hagerup and J. Katajainen, Eds. Lecture Notes in Computer Science, vol. 3111. Springer-Verlag, Berlin, Germany, 372--383.Google ScholarGoogle ScholarCross RefCross Ref
  36. Manzini, G. and Ferragina, P. 2004. Engineering a lightweight suffix array construction algorithm. Algorithmica 40, 33--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. McIlroy, M. D. 1997. ssort.c. http://cm.bell-labs.com/cm/cs/who/doug/source.html.Google ScholarGoogle Scholar
  38. McIlroy, P. M., Bostic, K., and McIlroy, M. D. 1993. Engineering radix sort. Comput. Syst. 6, 1, 5--27.Google ScholarGoogle Scholar
  39. Mori, Y. 2006. DivSufSort. http://www.homepage3.nifty.com/wpage/software/libdivsufsort.html.Google ScholarGoogle Scholar
  40. Munro, J. I. 1996. Tables. In Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science (FSTTCS). Lecture Notes in Computer Science, vol. 1180. Springer-Verlag, London, UK, 37--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Na, J. C. 2005. Linear-time construction of compressed suffix arrays using O(nlogn)-bit working space for large alphabets. In Proceedings of the 16th Annual Symposium Combinatorial Pattern Matching, A. Apostolico, M. Crochemore, and K. Park, Eds. Lecture Notes in Computer Science, vol. 3537. Springer-Verlag, Berlin, Germany, 57--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Navarro, G. and Mäkinen, V. 2007. Compressed full-text indexes. ACM Comput. Surv. 39, 1 (Apr.), Article 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Puglisi, S. J., Smyth, W. F., and Turpin, A. H. 2005. The performance of linear time suffix sorting algorithms. In Proceedings of the IEEE Data Compression Conference, M. Cohn and J. Storer, Eds. IEEE Computer Society Press, Los Alamitos, CA, 358--368. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Sadakane, K. 1998. A fast algorithm for making suffix arrays and for Burrows-Wheeler transformation. In DCC: Data Compression Conference. IEEE Computer Society Press, Los Alamitos, CA, 129--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Schürmann, K. and Stoye, J. 2005. An incomplex algorithm for fast suffix array construction. In Proceedings of the 7th Workshop on Algorithm Engineering and Experiments (ALENEX05). SIAM, 77--85.Google ScholarGoogle Scholar
  46. Seward, J. 2000. On the performance of BWT sorting algroithms. In DCC: Data Compression Conference. IEEE Computer Society Press, Los Alamitos, CA, 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Sim, J. S., Kim, D. K., Park, H., and Park, K. 2003. Linear-time search in suffix arrays. In Proceedings of the 14th Australasian Workshop on Combinatorial Algorithms, M. Miller and K. Park, Eds. (Seoul, Korea), 139--146.Google ScholarGoogle Scholar
  48. Sinha, R. and Zobel, J. 2004. Cache-conscious sorting of large sets of strings with dynamic tries. ACM J. Exper. Algor. 9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Smyth, B. 2003. Computing Patterns in Strings. Pearson Addison-Wesley, Essex, England.Google ScholarGoogle Scholar

Index Terms

  1. A taxonomy of suffix array construction algorithms

              Recommendations

              Reviews

              Neil D Burgess

              The existing body of knowledge in the rather esoteric area of suffix arrays is summarized in this paper. While it is quite demanding reading, it does make a useful contribution by publishing the results of a series of measured executions. Suffix arrays are used for efficient searches of large text files. They were originally developed by Myers and Manber in 1993, so there is a considerable body of knowledge to draw upon. In addition to a detailed academic discussion of the various algorithms, Puglisi et al. guide the reader as to which algorithm would be suitable for various implementation environments; tables of experimental results are provided. Near the end of the paper, some useful conclusions are presented. The existing body of knowledge on which this paper is based is detailed clearly. An academic style of referencing is used, and the authors have conducted a detailed investigation of a reasonable sample of the available algorithms. In conclusion, the paper is recommended for people who have a need to search large text files on a regular basis, and who have the responsibility of selecting the most suitable algorithm. The information in the paper is particularly valuable for use in a resource-constrained environment. Online Computing Reviews Service

              Access critical reviews of Computing literature here

              Become a reviewer for Computing Reviews.

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in

              Full Access

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader